{"title":"WordPPR:一种研究人员驱动的数字媒体文本数据检索计算关键词选择方法","authors":"Yini Zhang, Fan Chen, Jiyoun Suk, Zhiying Yue","doi":"10.1080/19312458.2023.2278177","DOIUrl":null,"url":null,"abstract":"ABSTRACTDespite the increasing use of digital media data in communication research, a central challenge persists – retrieving data with maximal accuracy and coverage. Our investigation of keyword-based data collection practices in extant communication research reveals a one-step process, whereas our cross-disciplinary literature review suggests an iterative query expansion process guided by human knowledge and computer intelligence. Hence, we introduce the WordPPR method for keyword selection and text data retrieval, which entails four steps: 1) collecting an initial dataset using core/seed keyword(s); 2) constructing a word graph based on the dataset; 3) applying the Personalized PageRank (PPR) algorithm to rank words in proximity to the seed keyword(s) and selecting new keywords that optimize retrieval precision and recall; 4) repeating steps 1–3 to determine if additional data collection is needed. Without requiring corpus-wide sampling/analysis or extensive manual annotation, this method is well suited for data collection from large-scale digital media corpora. Our simulation studies demonstrate its robustness against parameter choice and its improvement upon other methods in suggesting additional keywords. Its application in Twitter data retrieval is also provided. By advancing a more systematic approach to text data retrieval, this study contributes to improving digital media data retrieval practices in communication research and beyond. AcknowledgementWe thank our reviewers, the editors, Dr. Karl Rohe, Dr. Nojin Kwak, and Dr. Dhavan Shah for their helpful feedback. We also thank Rui Wang, Dongdong Yang, and Xinxia Dong for assistance with the journal article coding.Disclosure statementNo potential conflict of interest was reported by the author(s).Data availability statementThe method and application code files as well as the supplementary materials are available at https://osf.io/pcybz/.Supplementary materialSupplemental data for this article can be accessed online at https://doi.org/10.1080/19312458.2023.2278177.Additional informationNotes on contributorsYini ZhangYini Zhang (Ph.D., University of Wisconsin–Madison) is an assistant professor in the Department of Communication at the University at Buffalo, State University of New York. She studies social media, media ecosystem, and political communication, using computational methods.Fan ChenFan Chen (Ph.D., University of Wisconsin–Madison) is a Data Scientist at Google. He studies and develops statistical methods for social media, genomics, and advertisement data. The bulk of this work was completed while he was a Ph.D. student at the University of Wisconsin–Madison.Jiyoun SukJiyoun Suk (Ph.D., University of Wisconsin-Madison) is an assistant professor in the Department of Communication at the University of Connecticut. She studies the role of networked communication in shaping social trust, activism, and polarization, using computational methods.Zhiying YueZhiying Yue (Ph.D., University at Buffalo) is a postdoctoral researcher at the Digital Wellness Lab, Boston Children’s Hospital, and Harvard Medical School. Her research interests generally focus on individuals’ social media use and psychological well-being.","PeriodicalId":47552,"journal":{"name":"Communication Methods and Measures","volume":"51 3","pages":"0"},"PeriodicalIF":6.3000,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"WordPPR: A Researcher-Driven Computational Keyword Selection Method for Text Data Retrieval from Digital Media\",\"authors\":\"Yini Zhang, Fan Chen, Jiyoun Suk, Zhiying Yue\",\"doi\":\"10.1080/19312458.2023.2278177\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ABSTRACTDespite the increasing use of digital media data in communication research, a central challenge persists – retrieving data with maximal accuracy and coverage. Our investigation of keyword-based data collection practices in extant communication research reveals a one-step process, whereas our cross-disciplinary literature review suggests an iterative query expansion process guided by human knowledge and computer intelligence. Hence, we introduce the WordPPR method for keyword selection and text data retrieval, which entails four steps: 1) collecting an initial dataset using core/seed keyword(s); 2) constructing a word graph based on the dataset; 3) applying the Personalized PageRank (PPR) algorithm to rank words in proximity to the seed keyword(s) and selecting new keywords that optimize retrieval precision and recall; 4) repeating steps 1–3 to determine if additional data collection is needed. Without requiring corpus-wide sampling/analysis or extensive manual annotation, this method is well suited for data collection from large-scale digital media corpora. Our simulation studies demonstrate its robustness against parameter choice and its improvement upon other methods in suggesting additional keywords. Its application in Twitter data retrieval is also provided. By advancing a more systematic approach to text data retrieval, this study contributes to improving digital media data retrieval practices in communication research and beyond. AcknowledgementWe thank our reviewers, the editors, Dr. Karl Rohe, Dr. Nojin Kwak, and Dr. Dhavan Shah for their helpful feedback. We also thank Rui Wang, Dongdong Yang, and Xinxia Dong for assistance with the journal article coding.Disclosure statementNo potential conflict of interest was reported by the author(s).Data availability statementThe method and application code files as well as the supplementary materials are available at https://osf.io/pcybz/.Supplementary materialSupplemental data for this article can be accessed online at https://doi.org/10.1080/19312458.2023.2278177.Additional informationNotes on contributorsYini ZhangYini Zhang (Ph.D., University of Wisconsin–Madison) is an assistant professor in the Department of Communication at the University at Buffalo, State University of New York. She studies social media, media ecosystem, and political communication, using computational methods.Fan ChenFan Chen (Ph.D., University of Wisconsin–Madison) is a Data Scientist at Google. He studies and develops statistical methods for social media, genomics, and advertisement data. The bulk of this work was completed while he was a Ph.D. student at the University of Wisconsin–Madison.Jiyoun SukJiyoun Suk (Ph.D., University of Wisconsin-Madison) is an assistant professor in the Department of Communication at the University of Connecticut. She studies the role of networked communication in shaping social trust, activism, and polarization, using computational methods.Zhiying YueZhiying Yue (Ph.D., University at Buffalo) is a postdoctoral researcher at the Digital Wellness Lab, Boston Children’s Hospital, and Harvard Medical School. Her research interests generally focus on individuals’ social media use and psychological well-being.\",\"PeriodicalId\":47552,\"journal\":{\"name\":\"Communication Methods and Measures\",\"volume\":\"51 3\",\"pages\":\"0\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2023-11-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Communication Methods and Measures\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/19312458.2023.2278177\",\"RegionNum\":1,\"RegionCategory\":\"文学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMMUNICATION\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communication Methods and Measures","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/19312458.2023.2278177","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMMUNICATION","Score":null,"Total":0}
WordPPR: A Researcher-Driven Computational Keyword Selection Method for Text Data Retrieval from Digital Media
ABSTRACTDespite the increasing use of digital media data in communication research, a central challenge persists – retrieving data with maximal accuracy and coverage. Our investigation of keyword-based data collection practices in extant communication research reveals a one-step process, whereas our cross-disciplinary literature review suggests an iterative query expansion process guided by human knowledge and computer intelligence. Hence, we introduce the WordPPR method for keyword selection and text data retrieval, which entails four steps: 1) collecting an initial dataset using core/seed keyword(s); 2) constructing a word graph based on the dataset; 3) applying the Personalized PageRank (PPR) algorithm to rank words in proximity to the seed keyword(s) and selecting new keywords that optimize retrieval precision and recall; 4) repeating steps 1–3 to determine if additional data collection is needed. Without requiring corpus-wide sampling/analysis or extensive manual annotation, this method is well suited for data collection from large-scale digital media corpora. Our simulation studies demonstrate its robustness against parameter choice and its improvement upon other methods in suggesting additional keywords. Its application in Twitter data retrieval is also provided. By advancing a more systematic approach to text data retrieval, this study contributes to improving digital media data retrieval practices in communication research and beyond. AcknowledgementWe thank our reviewers, the editors, Dr. Karl Rohe, Dr. Nojin Kwak, and Dr. Dhavan Shah for their helpful feedback. We also thank Rui Wang, Dongdong Yang, and Xinxia Dong for assistance with the journal article coding.Disclosure statementNo potential conflict of interest was reported by the author(s).Data availability statementThe method and application code files as well as the supplementary materials are available at https://osf.io/pcybz/.Supplementary materialSupplemental data for this article can be accessed online at https://doi.org/10.1080/19312458.2023.2278177.Additional informationNotes on contributorsYini ZhangYini Zhang (Ph.D., University of Wisconsin–Madison) is an assistant professor in the Department of Communication at the University at Buffalo, State University of New York. She studies social media, media ecosystem, and political communication, using computational methods.Fan ChenFan Chen (Ph.D., University of Wisconsin–Madison) is a Data Scientist at Google. He studies and develops statistical methods for social media, genomics, and advertisement data. The bulk of this work was completed while he was a Ph.D. student at the University of Wisconsin–Madison.Jiyoun SukJiyoun Suk (Ph.D., University of Wisconsin-Madison) is an assistant professor in the Department of Communication at the University of Connecticut. She studies the role of networked communication in shaping social trust, activism, and polarization, using computational methods.Zhiying YueZhiying Yue (Ph.D., University at Buffalo) is a postdoctoral researcher at the Digital Wellness Lab, Boston Children’s Hospital, and Harvard Medical School. Her research interests generally focus on individuals’ social media use and psychological well-being.
期刊介绍:
Communication Methods and Measures aims to achieve several goals in the field of communication research. Firstly, it aims to bring attention to and showcase developments in both qualitative and quantitative research methodologies to communication scholars. This journal serves as a platform for researchers across the field to discuss and disseminate methodological tools and approaches.
Additionally, Communication Methods and Measures seeks to improve research design and analysis practices by offering suggestions for improvement. It aims to introduce new methods of measurement that are valuable to communication scientists or enhance existing methods. The journal encourages submissions that focus on methods for enhancing research design and theory testing, employing both quantitative and qualitative approaches.
Furthermore, the journal is open to articles devoted to exploring the epistemological aspects relevant to communication research methodologies. It welcomes well-written manuscripts that demonstrate the use of methods and articles that highlight the advantages of lesser-known or newer methods over those traditionally used in communication.
In summary, Communication Methods and Measures strives to advance the field of communication research by showcasing and discussing innovative methodologies, improving research practices, and introducing new measurement methods.