This special issue features eight articles on “Large-Scale Spatial Data Science.” Data science for complex and large-scale spatial and spatio-temporal data has become essential in many research fields, such as climate science and environmental applications. Due to the ever-increasing amounts of data collected, traditional statistical approaches tend to break down and computa-tionally efficient methods and scalable algorithms that are suitable for large-scale spatial data have become crucial to cope with many challenges associated with big data. This special issue aims at highlighting some of the latest developments in the area of large-scale spatial data science. The research papers presented showcase advanced statistical methods and machine learn-ing approaches for solving complex and large-scale problems arising from modern data science applications. Abdulah et al. (2022) reported the results of the second competition on spatial statistics for large datasets organized by the King Abdullah University of Science and Technology (KAUST). Very large datasets (up to 1 million in size) were generated with the ExaGeoStat software to design the competition on large-scale predictions in challenging settings, including univariate nonstationary spatial processes, univariate stationary space-time processes, and bivariate stationary spatial processes. The authors described the data generation process in detail in each setting and made these valuable datasets publicly available. They reviewed the methods used by fourteen competing teams worldwide, analyzed the results of the competition, and assessed the performance of each team.
{"title":"Editorial: Large-Scale Spatial Data Science","authors":"Sameh Abdulah, S. Castruccio, M. Genton, Ying Sun","doi":"10.6339/22-jds204edi","DOIUrl":"https://doi.org/10.6339/22-jds204edi","url":null,"abstract":"This special issue features eight articles on “Large-Scale Spatial Data Science.” Data science for complex and large-scale spatial and spatio-temporal data has become essential in many research fields, such as climate science and environmental applications. Due to the ever-increasing amounts of data collected, traditional statistical approaches tend to break down and computa-tionally efficient methods and scalable algorithms that are suitable for large-scale spatial data have become crucial to cope with many challenges associated with big data. This special issue aims at highlighting some of the latest developments in the area of large-scale spatial data science. The research papers presented showcase advanced statistical methods and machine learn-ing approaches for solving complex and large-scale problems arising from modern data science applications. Abdulah et al. (2022) reported the results of the second competition on spatial statistics for large datasets organized by the King Abdullah University of Science and Technology (KAUST). Very large datasets (up to 1 million in size) were generated with the ExaGeoStat software to design the competition on large-scale predictions in challenging settings, including univariate nonstationary spatial processes, univariate stationary space-time processes, and bivariate stationary spatial processes. The authors described the data generation process in detail in each setting and made these valuable datasets publicly available. They reviewed the methods used by fourteen competing teams worldwide, analyzed the results of the competition, and assessed the performance of each team.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Law and legal studies has been an exciting new field for data science applications whereas the technological advancement also has profound implications for legal practice. For example, the legal industry has accumulated a rich body of high quality texts, images and other digitised formats, which are ready to be further processed and analysed by data scientists. On the other hand, the increasing popularity of data science has been a genuine challenge to legal practitioners, regulators and even general public and has motivated a long-lasting debate in the academia focusing on issues such as privacy protection and algorithmic discrimination. This paper collects 1236 journal articles involving both law and data science from the platform Web of Science to understand the patterns and trends of this interdisciplinary research field in terms of English journal publications. We find a clear trend of increasing publication volume over time and a strong presence of high-impact law and political science journals. We then use the Latent Dirichlet Allocation (LDA) as a topic modelling method to classify the abstracts into four topics based on the coherence measure. The four topics identified confirm that both challenges and opportunities have been investigated in this interdisciplinary field and help offer directions for future research.
法律和法律研究一直是数据科学应用的一个令人兴奋的新领域,而技术的进步也对法律实践产生了深远的影响。例如,法律行业已经积累了丰富的高质量文本、图像和其他数字化格式,可供数据科学家进一步处理和分析。另一方面,数据科学的日益普及对法律从业者、监管机构甚至公众都是一个真正的挑战,并引发了学术界长期以来围绕隐私保护和算法歧视等问题的辩论。本文从Web of science平台上收集了1236篇涉及法律和数据科学的期刊文章,从英文期刊发表的角度来了解这一跨学科研究领域的模式和趋势。我们发现,随着时间的推移,出版物数量有明显的增长趋势,高影响力的法律和政治学期刊也有很强的影响力。然后,我们使用潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)作为主题建模方法,根据一致性度量将摘要分为四个主题。确定的四个主题确认了这一跨学科领域的挑战和机遇,并有助于为未来的研究提供方向。
{"title":"Data Science Applications and Implications in Legal Studies: A Perspective Through Topic Modelling","authors":"Jinzhe Tan, Huan Wan, Ping Yan, Zheng Hua Zhu","doi":"10.6339/22-jds1058","DOIUrl":"https://doi.org/10.6339/22-jds1058","url":null,"abstract":"Law and legal studies has been an exciting new field for data science applications whereas the technological advancement also has profound implications for legal practice. For example, the legal industry has accumulated a rich body of high quality texts, images and other digitised formats, which are ready to be further processed and analysed by data scientists. On the other hand, the increasing popularity of data science has been a genuine challenge to legal practitioners, regulators and even general public and has motivated a long-lasting debate in the academia focusing on issues such as privacy protection and algorithmic discrimination. This paper collects 1236 journal articles involving both law and data science from the platform Web of Science to understand the patterns and trends of this interdisciplinary research field in terms of English journal publications. We find a clear trend of increasing publication volume over time and a strong presence of high-impact law and political science journals. We then use the Latent Dirichlet Allocation (LDA) as a topic modelling method to classify the abstracts into four topics based on the coherence measure. The four topics identified confirm that both challenges and opportunities have been investigated in this interdisciplinary field and help offer directions for future research.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The 2020 Census County Assessment Tool was developed to assist decennial census data users in identifying deviations between expected census counts and the released counts across population and housing indicators. The tool also offers contextual data for each county on factors which could have contributed to census collection issues, such as self-response rates and COVID-19 infection rates. The tool compiles this information into a downloadable report and includes additional local data sources relevant to the data collection process and experts to seek more assistance.
{"title":"Creating a Census County Assessment Tool for Visualizing Census Data","authors":"Izzy Youngs, R. Prevost, Christopher Dick","doi":"10.6339/22-jds1082","DOIUrl":"https://doi.org/10.6339/22-jds1082","url":null,"abstract":"The 2020 Census County Assessment Tool was developed to assist decennial census data users in identifying deviations between expected census counts and the released counts across population and housing indicators. The tool also offers contextual data for each county on factors which could have contributed to census collection issues, such as self-response rates and COVID-19 infection rates. The tool compiles this information into a downloadable report and includes additional local data sources relevant to the data collection process and experts to seek more assistance.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many small and rural places are shrinking. Interactive dashboards are the most common use cases for data visualization and context for exploratory data tools. In our paper, we will use Iowa data to explore the specific scope of how dashboards are used in small and rural area to empower novice analysts to make data-driven decisions. Our framework will suggest a number of research directions to better support small and rural places from shrinking using an interactive dashboard design, implementation and use for the every day analyst.
{"title":"Exploring Rural Shrink Smart Through Guided Discovery Dashboards","authors":"Denise Bradford, Susan Vanderplas","doi":"10.6339/22-jds1080","DOIUrl":"https://doi.org/10.6339/22-jds1080","url":null,"abstract":"Many small and rural places are shrinking. Interactive dashboards are the most common use cases for data visualization and context for exploratory data tools. In our paper, we will use Iowa data to explore the specific scope of how dashboards are used in small and rural area to empower novice analysts to make data-driven decisions. Our framework will suggest a number of research directions to better support small and rural places from shrinking using an interactive dashboard design, implementation and use for the every day analyst.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Researchers need guidance on how to obtain maximum efficiency and accuracy when annotating training data for text classification applications. Further, given wide variability in the kinds of annotations researchers need to obtain, they would benefit from the ability to conduct low-cost experiments during the design phase of annotation projects. To this end, our study proposes the single-case study design as a feasible and causally-valid experimental design for determining the best procedures for a given annotation task. The key strength of the design is its ability to generate causal evidence at the individual level, identifying the impact of competing annotation techniques and interfaces for the specific annotator(s) included in an annotation project. In this paper, we demonstrate the application of the single-case study in an applied experiment and argue that future researchers should incorporate the design into the pilot stage of annotation projects so that, over time, a causally-valid body of knowledge regarding the best annotation techniques is built.
{"title":"Improving the Science of Annotation for Natural Language Processing: The Use of the Single-Case Study for Piloting Annotation Projects","authors":"Kylie L. Anglin, Arielle Boguslav, Todd Hall","doi":"10.6339/22-jds1054","DOIUrl":"https://doi.org/10.6339/22-jds1054","url":null,"abstract":"Researchers need guidance on how to obtain maximum efficiency and accuracy when annotating training data for text classification applications. Further, given wide variability in the kinds of annotations researchers need to obtain, they would benefit from the ability to conduct low-cost experiments during the design phase of annotation projects. To this end, our study proposes the single-case study design as a feasible and causally-valid experimental design for determining the best procedures for a given annotation task. The key strength of the design is its ability to generate causal evidence at the individual level, identifying the impact of competing annotation techniques and interfaces for the specific annotator(s) included in an annotation project. In this paper, we demonstrate the application of the single-case study in an applied experiment and argue that future researchers should incorporate the design into the pilot stage of annotation projects so that, over time, a causally-valid body of knowledge regarding the best annotation techniques is built.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article I analyse motion picture editing as a point process to explore the temporal structure in the timings of cuts in motion pictures, modelling the editing in 134 Hollywood films released between 1935 and 2005 as a Hawkes process with an exponential kernel. The results show that the editing in Hollywood films can be modelled as a Hawkes process and that the conditional intensity function provides a direct description of the instantaneous cutting rate of a film, revealing the structure of a film’s editing at a range of scales. The parameters of the exponential kernel show a clear trend over time to a more rapid editing style with an increase in the rate of exogenous events and small increase in the rate of endogenous events. This is consistent with the shift from a classical to an intensified continuity editing style. There are, however, few differences between genres indicating the consistency of editing practices in Hollywood cinema over time and different types of films.
{"title":"Motion Picture Editing as a Hawkes Process","authors":"Nick Redfern","doi":"10.6339/22-jds1055","DOIUrl":"https://doi.org/10.6339/22-jds1055","url":null,"abstract":"In this article I analyse motion picture editing as a point process to explore the temporal structure in the timings of cuts in motion pictures, modelling the editing in 134 Hollywood films released between 1935 and 2005 as a Hawkes process with an exponential kernel. The results show that the editing in Hollywood films can be modelled as a Hawkes process and that the conditional intensity function provides a direct description of the instantaneous cutting rate of a film, revealing the structure of a film’s editing at a range of scales. The parameters of the exponential kernel show a clear trend over time to a more rapid editing style with an increase in the rate of exogenous events and small increase in the rate of endogenous events. This is consistent with the shift from a classical to an intensified continuity editing style. There are, however, few differences between genres indicating the consistency of editing practices in Hollywood cinema over time and different types of films.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
International trade research plays an important role to inform trade policy and shed light on wider economic issues. With recent advances in information technology, economic agencies distribute an enormous amount of internationally comparable trading data, providing a gold mine for empirical analysis of international trade. International trading data can be viewed as a dynamic transport network because it emphasizes the amount of goods moving across network edges. Most literature on dynamic network analysis concentrates on parametric modeling of the connectivity network that focuses on link formation or deformation rather than the transport moving across the network. We take a different non-parametric perspective from the pervasive node-and-edge-level modeling: the dynamic transport network is modeled as a time series of relational matrices; variants of the matrix factor model of Wang et al. (2019) are applied to provide a specific interpretation for the dynamic transport network. Under the model, the observed surface network is assumed to be driven by a latent dynamic transport network with lower dimensions. Our method is able to unveil the latent dynamic structure and achieves the goal of dimension reduction. We applied the proposed method to a dataset of monthly trading volumes among 24 countries (and regions) from 1982 to 2015. Our findings shed light on trading hubs, centrality, trends, and patterns of international trade and show matching change points to trading policies. The dataset also provides a fertile ground for future research on international trade.
{"title":"Modeling Dynamic Transport Network with Matrix Factor Models: an Application to International Trade Flow","authors":"Elynn Y. Chen, Rong Chen","doi":"10.6339/22-jds1065","DOIUrl":"https://doi.org/10.6339/22-jds1065","url":null,"abstract":"International trade research plays an important role to inform trade policy and shed light on wider economic issues. With recent advances in information technology, economic agencies distribute an enormous amount of internationally comparable trading data, providing a gold mine for empirical analysis of international trade. International trading data can be viewed as a dynamic transport network because it emphasizes the amount of goods moving across network edges. Most literature on dynamic network analysis concentrates on parametric modeling of the connectivity network that focuses on link formation or deformation rather than the transport moving across the network. We take a different non-parametric perspective from the pervasive node-and-edge-level modeling: the dynamic transport network is modeled as a time series of relational matrices; variants of the matrix factor model of Wang et al. (2019) are applied to provide a specific interpretation for the dynamic transport network. Under the model, the observed surface network is assumed to be driven by a latent dynamic transport network with lower dimensions. Our method is able to unveil the latent dynamic structure and achieves the goal of dimension reduction. We applied the proposed method to a dataset of monthly trading volumes among 24 countries (and regions) from 1982 to 2015. Our findings shed light on trading hubs, centrality, trends, and patterns of international trade and show matching change points to trading policies. The dataset also provides a fertile ground for future research on international trade.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces flowthrough centrality, a node centrality measure determined from the hierarchical maximum concurrent flow problem (HMCFP). Based upon the extent to which a node is acting as a hub within a network, this centrality measure is defined to be the fraction of the flow passing through the node to the total flow capacity of the node. Flowthrough centrality is compared to the commonly-used centralities of closeness centrality, betweenness centrality, and flow betweenness centrality, as well as to stable betweenness centrality to measure the stability (i.e., accuracy) of the centralities when knowledge of the network topology is incomplete or in transition. Perturbations do not alter the flowthrough centrality values of nodes that are based upon flow as much as they do other types of centrality values that are based upon geodesics. The flowthrough centrality measure overcomes the problem of overstating or understating the roles that significant actors play in social networks. The flowthrough centrality is canonical in that it is determined from a natural, realized flow universally applicable to all networks.
{"title":"Flowthrough Centrality: A Stable Node Centrality Measure","authors":"Charles F. Mann, M. McGee, E. Olinick, D. Matula","doi":"10.6339/22-jds1081","DOIUrl":"https://doi.org/10.6339/22-jds1081","url":null,"abstract":"This paper introduces flowthrough centrality, a node centrality measure determined from the hierarchical maximum concurrent flow problem (HMCFP). Based upon the extent to which a node is acting as a hub within a network, this centrality measure is defined to be the fraction of the flow passing through the node to the total flow capacity of the node. Flowthrough centrality is compared to the commonly-used centralities of closeness centrality, betweenness centrality, and flow betweenness centrality, as well as to stable betweenness centrality to measure the stability (i.e., accuracy) of the centralities when knowledge of the network topology is incomplete or in transition. Perturbations do not alter the flowthrough centrality values of nodes that are based upon flow as much as they do other types of centrality values that are based upon geodesics. The flowthrough centrality measure overcomes the problem of overstating or understating the roles that significant actors play in social networks. The flowthrough centrality is canonical in that it is determined from a natural, realized flow universally applicable to all networks.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
When releasing data to the public, a vital concern is the risk of exposing personal information of the individuals who have contributed to the data set. Many mechanisms have been proposed to protect individual privacy, though less attention has been dedicated to practically conducting valid inferences on the altered privacy-protected data sets. For frequency tables, the privacy-protection-oriented perturbations often lead to negative cell counts. Releasing such tables can undermine users’ confidence in the usefulness of such data sets. This paper focuses on releasing one-way frequency tables. We recommend an optimal mechanism that satisfies ϵ-differential privacy (DP) without suffering from having negative cell counts. The procedure is optimal in the sense that the expected utility is maximized under a given privacy constraint. Valid inference procedures for testing goodness-of-fit are also developed for the DP privacy-protected data. In particular, we propose a de-biased test statistic for the optimal procedure and derive its asymptotic distribution. In addition, we also introduce testing procedures for the commonly used Laplace and Gaussian mechanisms, which provide a good finite sample approximation for the null distributions. Moreover, the decaying rate requirements for the privacy regime are provided for the inference procedures to be valid. We further consider common users’ practices such as merging related or neighboring cells or integrating statistical information obtained across different data sources and derive valid testing procedures when these operations occur. Simulation studies show that our inference results hold well even when the sample size is relatively small. Comparisons with the current field standards, including the Laplace, the Gaussian (both with/without post-processing of replacing negative cell counts with zeros), and the Binomial-Beta McClure-Reiter mechanisms, are carried out. In the end, we apply our method to the National Center for Early Development and Learning’s (NCEDL) multi-state studies data to demonstrate its practical applicability.
{"title":"Inference for Optimal Differential Privacy Procedures for Frequency Tables","authors":"Chengcheng Li, Na Wang, Gongjun Xu","doi":"10.6339/22-jds1044","DOIUrl":"https://doi.org/10.6339/22-jds1044","url":null,"abstract":"When releasing data to the public, a vital concern is the risk of exposing personal information of the individuals who have contributed to the data set. Many mechanisms have been proposed to protect individual privacy, though less attention has been dedicated to practically conducting valid inferences on the altered privacy-protected data sets. For frequency tables, the privacy-protection-oriented perturbations often lead to negative cell counts. Releasing such tables can undermine users’ confidence in the usefulness of such data sets. This paper focuses on releasing one-way frequency tables. We recommend an optimal mechanism that satisfies ϵ-differential privacy (DP) without suffering from having negative cell counts. The procedure is optimal in the sense that the expected utility is maximized under a given privacy constraint. Valid inference procedures for testing goodness-of-fit are also developed for the DP privacy-protected data. In particular, we propose a de-biased test statistic for the optimal procedure and derive its asymptotic distribution. In addition, we also introduce testing procedures for the commonly used Laplace and Gaussian mechanisms, which provide a good finite sample approximation for the null distributions. Moreover, the decaying rate requirements for the privacy regime are provided for the inference procedures to be valid. We further consider common users’ practices such as merging related or neighboring cells or integrating statistical information obtained across different data sources and derive valid testing procedures when these operations occur. Simulation studies show that our inference results hold well even when the sample size is relatively small. Comparisons with the current field standards, including the Laplace, the Gaussian (both with/without post-processing of replacing negative cell counts with zeros), and the Binomial-Beta McClure-Reiter mechanisms, are carried out. In the end, we apply our method to the National Center for Early Development and Learning’s (NCEDL) multi-state studies data to demonstrate its practical applicability.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Social network data often contain missing values because of the sensitive nature of the information collected and the dependency among the network actors. As a response, network imputation methods including simple ones constructed from network structural characteristics and more complicated model-based ones have been developed. Although past studies have explored the influence of missing data on social networks and the effectiveness of imputation procedures in many missing data conditions, the current study aims to evaluate a more extensive set of eight network imputation techniques (i.e., null-tie, Reconstruction, Preferential Attachment, Constrained Random Dot Product Graph, Multiple Imputation by Bayesian Exponential Random Graph Models or BERGMs, k-Nearest Neighbors, Random Forest, and Multiple Imputation by Chained Equations) under more practical conditions through comprehensive simulation. A factorial design for missing data conditions is adopted with factors including missing data types, missing data mechanisms, and missing data proportions, which are applied to generated social networks with varying numbers of actors based on 4 different sets of coefficients in ERGMs. Results show that the effectiveness of imputation methods differs by missing data types, missing data mechanisms, the evaluation criteria used, and the complexity of the social networks. More complex methods such as the BERGMs have consistently good performances in recovering missing edges that should have been present. While simpler methods like Reconstruction work better in recovering network statistics when the missing proportion of present edges is low, the BERGMs work better when more present edges are missing. The BERGMs also work well in recovering ERGM coefficients when the networks are complex and the missing data type is actor non-response. In conclusion, researchers analyzing social networks with incomplete data should identify the network structures of interest and the potential missing data types before selecting appropriate imputation methods.
{"title":"Comparison of Methods for Imputing Social Network Data","authors":"Ziqian Xu, Jiarui Hai, Yutong Yang, Zhiyong Zhang","doi":"10.6339/22-jds1045","DOIUrl":"https://doi.org/10.6339/22-jds1045","url":null,"abstract":"Social network data often contain missing values because of the sensitive nature of the information collected and the dependency among the network actors. As a response, network imputation methods including simple ones constructed from network structural characteristics and more complicated model-based ones have been developed. Although past studies have explored the influence of missing data on social networks and the effectiveness of imputation procedures in many missing data conditions, the current study aims to evaluate a more extensive set of eight network imputation techniques (i.e., null-tie, Reconstruction, Preferential Attachment, Constrained Random Dot Product Graph, Multiple Imputation by Bayesian Exponential Random Graph Models or BERGMs, k-Nearest Neighbors, Random Forest, and Multiple Imputation by Chained Equations) under more practical conditions through comprehensive simulation. A factorial design for missing data conditions is adopted with factors including missing data types, missing data mechanisms, and missing data proportions, which are applied to generated social networks with varying numbers of actors based on 4 different sets of coefficients in ERGMs. Results show that the effectiveness of imputation methods differs by missing data types, missing data mechanisms, the evaluation criteria used, and the complexity of the social networks. More complex methods such as the BERGMs have consistently good performances in recovering missing edges that should have been present. While simpler methods like Reconstruction work better in recovering network statistics when the missing proportion of present edges is low, the BERGMs work better when more present edges are missing. The BERGMs also work well in recovering ERGM coefficients when the networks are complex and the missing data type is actor non-response. In conclusion, researchers analyzing social networks with incomplete data should identify the network structures of interest and the potential missing data types before selecting appropriate imputation methods.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}