Abstract A single dataset is rarely sufficient to address a question of substantive interest. Instead, most applied data analysis combines data from multiple sources. Very rarely do two datasets contain the same identifiers with which to merge datasets; fields like name, address, and phone number may be entered incorrectly, missing, or in dissimilar formats. Combining multiple datasets absent a unique identifier that unambiguously connects entries is called the record linkage problem. While recent work has made great progress in the case where there are many possible fields on which to match, the much more uncertain case of only one identifying field remains unsolved: this fuzzy string matching problem, both its own problem and a component of standard record linkage problems, is our focus. We design and validate an algorithmic solution called Adaptive Fuzzy String Matching rooted in adaptive learning, and show that our tool identifies more matches, with higher precision, than existing solutions. Finally, we illustrate its validity and practical value through applications to matching organizations, places, and individuals.
{"title":"Adaptive Fuzzy String Matching: How to Merge Datasets with Only One (Messy) Identifying Field","authors":"A. Kaufman, Aja Klevs","doi":"10.1017/pan.2021.38","DOIUrl":"https://doi.org/10.1017/pan.2021.38","url":null,"abstract":"Abstract A single dataset is rarely sufficient to address a question of substantive interest. Instead, most applied data analysis combines data from multiple sources. Very rarely do two datasets contain the same identifiers with which to merge datasets; fields like name, address, and phone number may be entered incorrectly, missing, or in dissimilar formats. Combining multiple datasets absent a unique identifier that unambiguously connects entries is called the record linkage problem. While recent work has made great progress in the case where there are many possible fields on which to match, the much more uncertain case of only one identifying field remains unsolved: this fuzzy string matching problem, both its own problem and a component of standard record linkage problems, is our focus. We design and validate an algorithmic solution called Adaptive Fuzzy String Matching rooted in adaptive learning, and show that our tool identifies more matches, with higher precision, than existing solutions. Finally, we illustrate its validity and practical value through applications to matching organizations, places, and individuals.","PeriodicalId":48270,"journal":{"name":"Political Analysis","volume":"30 1","pages":"590 - 596"},"PeriodicalIF":5.4,"publicationDate":"2021-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46252130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Topic models, as developed in computer science, are effective tools for exploring and summarizing large document collections. When applied in social science research, however, they are commonly used for measurement, a task that requires careful validation to ensure that the model outputs actually capture the desired concept of interest. In this paper, we review current practices for topic validation in the field and show that extensive model validation is increasingly rare, or at least not systematically reported in papers and appendices. To supplement current practices, we refine an existing crowd-sourcing method by Chang and coauthors for validating topic quality and go on to create new procedures for validating conceptual labels provided by the researcher. We illustrate our method with an analysis of Facebook posts by U.S. Senators and provide software and guidance for researchers wishing to validate their own topic models. While tailored, case-specific validation exercises will always be best, we aim to improve standard practices by providing a general-purpose tool to validate topics as measures.
{"title":"Topics, Concepts, and Measurement: A Crowdsourced Procedure for Validating Topics as Measures","authors":"Luwei Ying, J. Montgomery, Brandon M Stewart","doi":"10.1017/pan.2021.33","DOIUrl":"https://doi.org/10.1017/pan.2021.33","url":null,"abstract":"Abstract Topic models, as developed in computer science, are effective tools for exploring and summarizing large document collections. When applied in social science research, however, they are commonly used for measurement, a task that requires careful validation to ensure that the model outputs actually capture the desired concept of interest. In this paper, we review current practices for topic validation in the field and show that extensive model validation is increasingly rare, or at least not systematically reported in papers and appendices. To supplement current practices, we refine an existing crowd-sourcing method by Chang and coauthors for validating topic quality and go on to create new procedures for validating conceptual labels provided by the researcher. We illustrate our method with an analysis of Facebook posts by U.S. Senators and provide software and guidance for researchers wishing to validate their own topic models. While tailored, case-specific validation exercises will always be best, we aim to improve standard practices by providing a general-purpose tool to validate topics as measures.","PeriodicalId":48270,"journal":{"name":"Political Analysis","volume":"30 1","pages":"570 - 589"},"PeriodicalIF":5.4,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42494896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Internet-based surveys have expanded public opinion data collection at the expense of monitoring respondent attentiveness, potentially compromising data quality. Researchers now have to evaluate attentiveness ex-post. We propose a new proxy for attentiveness—response-time attentiveness clustering (RTAC)—that uses dimension reduction and an unsupervised clustering algorithm to leverage variation in response time between respondents and across questions. We advance the literature theoretically arguing that the existing dichotomous classification of respondents as fast or attentive is insufficient and neglects slow and inattentive respondents. We validate our theoretical classification and empirical strategy against commonly used proxies for survey attentiveness. In contrast to other methods for capturing attentiveness, RTAC allows researchers to collect attentiveness data unobtrusively without sacrificing space on the survey instrument.
{"title":"Racing the Clock: Using Response Time as a Proxy for Attentiveness on Self-Administered Surveys","authors":"Blair Read, L. Wolters, Adam J. Berinsky","doi":"10.1017/pan.2021.32","DOIUrl":"https://doi.org/10.1017/pan.2021.32","url":null,"abstract":"Abstract Internet-based surveys have expanded public opinion data collection at the expense of monitoring respondent attentiveness, potentially compromising data quality. Researchers now have to evaluate attentiveness ex-post. We propose a new proxy for attentiveness—response-time attentiveness clustering (RTAC)—that uses dimension reduction and an unsupervised clustering algorithm to leverage variation in response time between respondents and across questions. We advance the literature theoretically arguing that the existing dichotomous classification of respondents as fast or attentive is insufficient and neglects slow and inattentive respondents. We validate our theoretical classification and empirical strategy against commonly used proxies for survey attentiveness. In contrast to other methods for capturing attentiveness, RTAC allows researchers to collect attentiveness data unobtrusively without sacrificing space on the survey instrument.","PeriodicalId":48270,"journal":{"name":"Political Analysis","volume":"30 1","pages":"550 - 569"},"PeriodicalIF":5.4,"publicationDate":"2021-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48584782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kelly M. McMann, Daniel Pemstein, Brigitte Seim, Jan Teorell, Staffan I. Lindberg
Abstract Political scientists routinely face the challenge of assessing the quality (validity and reliability) of measures in order to use them in substantive research. While stand-alone assessment tools exist, researchers rarely combine them comprehensively. Further, while a large literature informs data producers, data consumers lack guidance on how to assess existing measures for use in substantive research. We delineate a three-component practical approach to data quality assessment that integrates complementary multimethod tools to assess: (1) content validity; (2) the validity and reliability of the data generation process; and (3) convergent validity. We apply our quality assessment approach to the corruption measures from the Varieties of Democracy (V-Dem) project, both illustrating our rubric and unearthing several quality advantages and disadvantages of the V-Dem measures, compared to other existing measures of corruption.
{"title":"Assessing Data Quality: An Approach and An Application","authors":"Kelly M. McMann, Daniel Pemstein, Brigitte Seim, Jan Teorell, Staffan I. Lindberg","doi":"10.1017/pan.2021.27","DOIUrl":"https://doi.org/10.1017/pan.2021.27","url":null,"abstract":"Abstract Political scientists routinely face the challenge of assessing the quality (validity and reliability) of measures in order to use them in substantive research. While stand-alone assessment tools exist, researchers rarely combine them comprehensively. Further, while a large literature informs data producers, data consumers lack guidance on how to assess existing measures for use in substantive research. We delineate a three-component practical approach to data quality assessment that integrates complementary multimethod tools to assess: (1) content validity; (2) the validity and reliability of the data generation process; and (3) convergent validity. We apply our quality assessment approach to the corruption measures from the Varieties of Democracy (V-Dem) project, both illustrating our rubric and unearthing several quality advantages and disadvantages of the V-Dem measures, compared to other existing measures of corruption.","PeriodicalId":48270,"journal":{"name":"Political Analysis","volume":"30 1","pages":"426 - 449"},"PeriodicalIF":5.4,"publicationDate":"2021-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48353508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Party systems, that is, the number and the size of all the parties within a country, can vary greatly across countries. I conduct a principal component analysis on a party seat share dataset of 17 advanced democracies from 1970 to 2013 to reduce the dimensionality of the data. I find that the most important dimensions that differentiate party systems are: “the size of the biggest two parties” and the level of “competition between the two biggest parties.” I use the results to compare the changes in electoral and legislative party systems. I also juxtapose the results to previous party system typologies and party system size measures. I find that typologies sort countries into categories based on variation along both dimensions. On the other hand, most of the current political science literature use measures (e.g., the effective number of parties) that are correlated with the first dimension. I suggest that instead of these, indices that measure the opposition structure and competition could be used to explore problems pertaining to the competitiveness of the party systems.
{"title":"What Makes Party Systems Different? A Principal Component Analysis of 17 Advanced Democracies 1970–2013","authors":"Zsuzsanna B. Magyar","doi":"10.1017/pan.2021.21","DOIUrl":"https://doi.org/10.1017/pan.2021.21","url":null,"abstract":"Abstract Party systems, that is, the number and the size of all the parties within a country, can vary greatly across countries. I conduct a principal component analysis on a party seat share dataset of 17 advanced democracies from 1970 to 2013 to reduce the dimensionality of the data. I find that the most important dimensions that differentiate party systems are: “the size of the biggest two parties” and the level of “competition between the two biggest parties.” I use the results to compare the changes in electoral and legislative party systems. I also juxtapose the results to previous party system typologies and party system size measures. I find that typologies sort countries into categories based on variation along both dimensions. On the other hand, most of the current political science literature use measures (e.g., the effective number of parties) that are correlated with the first dimension. I suggest that instead of these, indices that measure the opposition structure and competition could be used to explore problems pertaining to the competitiveness of the party systems.","PeriodicalId":48270,"journal":{"name":"Political Analysis","volume":"30 1","pages":"250 - 268"},"PeriodicalIF":5.4,"publicationDate":"2021-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49552043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Social scientists have become increasingly interested in how narratives—the stories in fiction, politics, and life—shape beliefs, behavior, and government policies. This paper provides an unsupervised method to quantify latent narrative structures in text documents. Our new software package relatio identifies coherent entity groups and maps explicit relations between them in the text. We provide an application to the U.S. Congressional Record to analyze political and economic narratives in recent decades. Our analysis highlights the dynamics, sentiment, polarization, and interconnectedness of narratives in political discourse.
{"title":"Relatio: Text Semantics Capture Political and Economic Narratives","authors":"Elliott Ash, G. Gauthier, Philine Widmer","doi":"10.1017/pan.2023.8","DOIUrl":"https://doi.org/10.1017/pan.2023.8","url":null,"abstract":"\u0000 Social scientists have become increasingly interested in how narratives—the stories in fiction, politics, and life—shape beliefs, behavior, and government policies. This paper provides an unsupervised method to quantify latent narrative structures in text documents. Our new software package relatio identifies coherent entity groups and maps explicit relations between them in the text. We provide an application to the U.S. Congressional Record to analyze political and economic narratives in recent decades. Our analysis highlights the dynamics, sentiment, polarization, and interconnectedness of narratives in political discourse.","PeriodicalId":48270,"journal":{"name":"Political Analysis","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2021-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47460073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Bayesian analysis has emerged as a rapidly expanding frontier in qualitative methods. Recent work in this journal has voiced various doubts regarding how to implement Bayesian process tracing and the costs versus benefits of this approach. In this response, we articulate a very different understanding of the state of the method and a much more positive view of what Bayesian reasoning can do to strengthen qualitative social science. Drawing on forthcoming research as well as our earlier work, we focus on clarifying issues involving mutual exclusivity of hypotheses, evidentiary import, adjudicating among more than two hypotheses, and the logic of iterative research, with the goal of elucidating how Bayesian analysis operates and pushing the field forward.
{"title":"Understanding Bayesianism: Fundamentals for Process Tracers","authors":"Andrew Bennett, A. Charman, Tasha Fairfield","doi":"10.1017/pan.2021.23","DOIUrl":"https://doi.org/10.1017/pan.2021.23","url":null,"abstract":"Abstract Bayesian analysis has emerged as a rapidly expanding frontier in qualitative methods. Recent work in this journal has voiced various doubts regarding how to implement Bayesian process tracing and the costs versus benefits of this approach. In this response, we articulate a very different understanding of the state of the method and a much more positive view of what Bayesian reasoning can do to strengthen qualitative social science. Drawing on forthcoming research as well as our earlier work, we focus on clarifying issues involving mutual exclusivity of hypotheses, evidentiary import, adjudicating among more than two hypotheses, and the logic of iterative research, with the goal of elucidating how Bayesian analysis operates and pushing the field forward.","PeriodicalId":48270,"journal":{"name":"Political Analysis","volume":"30 1","pages":"298 - 305"},"PeriodicalIF":5.4,"publicationDate":"2021-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1017/pan.2021.23","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45484718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract In a recent article, I argued that the Bayesian process tracing literature exhibits a persistent disconnect between principle and practice. In their response, Bennett, Fairfield, and Charman raise important points and interesting questions about the method and its merits. This letter breaks from the ongoing point-by-point format of the debate by asking one question: In the most straightforward case, does the literature equip a reasonable scholar with the tools to conduct a rigorous analysis? I answer this question by walking through a qualitative Bayesian analysis of the simplest example: analyzing evidence of a murder. Along the way, I catalogue every question, complication, and pitfall I run into. Notwithstanding some important clarifications, I demonstrate that aspiring practitioners are still facing a method without guidelines or guardrails.
{"title":"Return to the Scene of the Crime: Revisiting Process Tracing, Bayesianism, and Murder","authors":"Sherry Zaks","doi":"10.1017/pan.2021.24","DOIUrl":"https://doi.org/10.1017/pan.2021.24","url":null,"abstract":"Abstract In a recent article, I argued that the Bayesian process tracing literature exhibits a persistent disconnect between principle and practice. In their response, Bennett, Fairfield, and Charman raise important points and interesting questions about the method and its merits. This letter breaks from the ongoing point-by-point format of the debate by asking one question: In the most straightforward case, does the literature equip a reasonable scholar with the tools to conduct a rigorous analysis? I answer this question by walking through a qualitative Bayesian analysis of the simplest example: analyzing evidence of a murder. Along the way, I catalogue every question, complication, and pitfall I run into. Notwithstanding some important clarifications, I demonstrate that aspiring practitioners are still facing a method without guidelines or guardrails.","PeriodicalId":48270,"journal":{"name":"Political Analysis","volume":"30 1","pages":"306 - 310"},"PeriodicalIF":5.4,"publicationDate":"2021-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1017/pan.2021.24","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45454257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}