首页 > 最新文献

Journal of data science : JDS最新文献

英文 中文
Editorial: Large-Scale Spatial Data Science 社论:大规模空间数据科学
Pub Date : 2022-01-01 DOI: 10.6339/22-jds204edi
Sameh Abdulah, S. Castruccio, M. Genton, Ying Sun
This special issue features eight articles on “Large-Scale Spatial Data Science.” Data science for complex and large-scale spatial and spatio-temporal data has become essential in many research fields, such as climate science and environmental applications. Due to the ever-increasing amounts of data collected, traditional statistical approaches tend to break down and computa-tionally efficient methods and scalable algorithms that are suitable for large-scale spatial data have become crucial to cope with many challenges associated with big data. This special issue aims at highlighting some of the latest developments in the area of large-scale spatial data science. The research papers presented showcase advanced statistical methods and machine learn-ing approaches for solving complex and large-scale problems arising from modern data science applications. Abdulah et al. (2022) reported the results of the second competition on spatial statistics for large datasets organized by the King Abdullah University of Science and Technology (KAUST). Very large datasets (up to 1 million in size) were generated with the ExaGeoStat software to design the competition on large-scale predictions in challenging settings, including univariate nonstationary spatial processes, univariate stationary space-time processes, and bivariate stationary spatial processes. The authors described the data generation process in detail in each setting and made these valuable datasets publicly available. They reviewed the methods used by fourteen competing teams worldwide, analyzed the results of the competition, and assessed the performance of each team.
本期特刊收录了八篇关于“大规模空间数据科学”的文章。在气候科学和环境应用等许多研究领域,复杂和大规模的时空数据的数据科学已经成为必不可少的。由于收集的数据量不断增加,传统的统计方法趋于崩溃,适合大规模空间数据的计算效率方法和可扩展算法对于应对与大数据相关的许多挑战变得至关重要。本期特刊旨在重点介绍大规模空间数据科学领域的一些最新发展。这些研究论文展示了先进的统计方法和机器学习方法,用于解决现代数据科学应用中出现的复杂和大规模问题。Abdulah等人(2022)报告了由阿卜杜拉国王科技大学(KAUST)组织的第二次大型数据集空间统计竞赛的结果。使用exeostat软件生成了非常大的数据集(多达100万),以设计在具有挑战性的环境下进行大规模预测的竞赛,包括单变量非平稳空间过程、单变量平稳时空过程和双变量平稳空间过程。作者详细描述了每种情况下的数据生成过程,并公开了这些有价值的数据集。他们回顾了全球14支参赛队伍使用的方法,分析了比赛结果,并评估了每支队伍的表现。
{"title":"Editorial: Large-Scale Spatial Data Science","authors":"Sameh Abdulah, S. Castruccio, M. Genton, Ying Sun","doi":"10.6339/22-jds204edi","DOIUrl":"https://doi.org/10.6339/22-jds204edi","url":null,"abstract":"This special issue features eight articles on “Large-Scale Spatial Data Science.” Data science for complex and large-scale spatial and spatio-temporal data has become essential in many research fields, such as climate science and environmental applications. Due to the ever-increasing amounts of data collected, traditional statistical approaches tend to break down and computa-tionally efficient methods and scalable algorithms that are suitable for large-scale spatial data have become crucial to cope with many challenges associated with big data. This special issue aims at highlighting some of the latest developments in the area of large-scale spatial data science. The research papers presented showcase advanced statistical methods and machine learn-ing approaches for solving complex and large-scale problems arising from modern data science applications. Abdulah et al. (2022) reported the results of the second competition on spatial statistics for large datasets organized by the King Abdullah University of Science and Technology (KAUST). Very large datasets (up to 1 million in size) were generated with the ExaGeoStat software to design the competition on large-scale predictions in challenging settings, including univariate nonstationary spatial processes, univariate stationary space-time processes, and bivariate stationary spatial processes. The authors described the data generation process in detail in each setting and made these valuable datasets publicly available. They reviewed the methods used by fourteen competing teams worldwide, analyzed the results of the competition, and assessed the performance of each team.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data Science Applications and Implications in Legal Studies: A Perspective Through Topic Modelling 数据科学在法律研究中的应用和意义:通过主题建模的视角
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1058
Jinzhe Tan, Huan Wan, Ping Yan, Zheng Hua Zhu
Law and legal studies has been an exciting new field for data science applications whereas the technological advancement also has profound implications for legal practice. For example, the legal industry has accumulated a rich body of high quality texts, images and other digitised formats, which are ready to be further processed and analysed by data scientists. On the other hand, the increasing popularity of data science has been a genuine challenge to legal practitioners, regulators and even general public and has motivated a long-lasting debate in the academia focusing on issues such as privacy protection and algorithmic discrimination. This paper collects 1236 journal articles involving both law and data science from the platform Web of Science to understand the patterns and trends of this interdisciplinary research field in terms of English journal publications. We find a clear trend of increasing publication volume over time and a strong presence of high-impact law and political science journals. We then use the Latent Dirichlet Allocation (LDA) as a topic modelling method to classify the abstracts into four topics based on the coherence measure. The four topics identified confirm that both challenges and opportunities have been investigated in this interdisciplinary field and help offer directions for future research.
法律和法律研究一直是数据科学应用的一个令人兴奋的新领域,而技术的进步也对法律实践产生了深远的影响。例如,法律行业已经积累了丰富的高质量文本、图像和其他数字化格式,可供数据科学家进一步处理和分析。另一方面,数据科学的日益普及对法律从业者、监管机构甚至公众都是一个真正的挑战,并引发了学术界长期以来围绕隐私保护和算法歧视等问题的辩论。本文从Web of science平台上收集了1236篇涉及法律和数据科学的期刊文章,从英文期刊发表的角度来了解这一跨学科研究领域的模式和趋势。我们发现,随着时间的推移,出版物数量有明显的增长趋势,高影响力的法律和政治学期刊也有很强的影响力。然后,我们使用潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)作为主题建模方法,根据一致性度量将摘要分为四个主题。确定的四个主题确认了这一跨学科领域的挑战和机遇,并有助于为未来的研究提供方向。
{"title":"Data Science Applications and Implications in Legal Studies: A Perspective Through Topic Modelling","authors":"Jinzhe Tan, Huan Wan, Ping Yan, Zheng Hua Zhu","doi":"10.6339/22-jds1058","DOIUrl":"https://doi.org/10.6339/22-jds1058","url":null,"abstract":"Law and legal studies has been an exciting new field for data science applications whereas the technological advancement also has profound implications for legal practice. For example, the legal industry has accumulated a rich body of high quality texts, images and other digitised formats, which are ready to be further processed and analysed by data scientists. On the other hand, the increasing popularity of data science has been a genuine challenge to legal practitioners, regulators and even general public and has motivated a long-lasting debate in the academia focusing on issues such as privacy protection and algorithmic discrimination. This paper collects 1236 journal articles involving both law and data science from the platform Web of Science to understand the patterns and trends of this interdisciplinary research field in terms of English journal publications. We find a clear trend of increasing publication volume over time and a strong presence of high-impact law and political science journals. We then use the Latent Dirichlet Allocation (LDA) as a topic modelling method to classify the abstracts into four topics based on the coherence measure. The four topics identified confirm that both challenges and opportunities have been investigated in this interdisciplinary field and help offer directions for future research.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Creating a Census County Assessment Tool for Visualizing Census Data 创建一个用于可视化普查数据的普查县评估工具
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1082
Izzy Youngs, R. Prevost, Christopher Dick
The 2020 Census County Assessment Tool was developed to assist decennial census data users in identifying deviations between expected census counts and the released counts across population and housing indicators. The tool also offers contextual data for each county on factors which could have contributed to census collection issues, such as self-response rates and COVID-19 infection rates. The tool compiles this information into a downloadable report and includes additional local data sources relevant to the data collection process and experts to seek more assistance.
开发2020年人口普查县评估工具是为了帮助十年一次的人口普查数据用户识别人口和住房指标中预期人口普查计数与公布计数之间的偏差。该工具还为每个县提供了可能导致人口普查收集问题的因素的背景数据,如自我反应率和COVID-19感染率。该工具将这些信息汇编成一份可下载的报告,并包括与数据收集过程和专家有关的其他当地数据源,以寻求更多援助。
{"title":"Creating a Census County Assessment Tool for Visualizing Census Data","authors":"Izzy Youngs, R. Prevost, Christopher Dick","doi":"10.6339/22-jds1082","DOIUrl":"https://doi.org/10.6339/22-jds1082","url":null,"abstract":"The 2020 Census County Assessment Tool was developed to assist decennial census data users in identifying deviations between expected census counts and the released counts across population and housing indicators. The tool also offers contextual data for each county on factors which could have contributed to census collection issues, such as self-response rates and COVID-19 infection rates. The tool compiles this information into a downloadable report and includes additional local data sources relevant to the data collection process and experts to seek more assistance.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Exploring Rural Shrink Smart Through Guided Discovery Dashboards 通过引导发现仪表板探索农村收缩智能
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1080
Denise Bradford, Susan Vanderplas
Many small and rural places are shrinking. Interactive dashboards are the most common use cases for data visualization and context for exploratory data tools. In our paper, we will use Iowa data to explore the specific scope of how dashboards are used in small and rural area to empower novice analysts to make data-driven decisions. Our framework will suggest a number of research directions to better support small and rural places from shrinking using an interactive dashboard design, implementation and use for the every day analyst.
许多小地方和农村地区正在萎缩。交互式仪表板是数据可视化和探索性数据工具上下文最常见的用例。在我们的论文中,我们将使用爱荷华州的数据来探索仪表板如何在小型和农村地区使用的具体范围,以使新手分析师能够做出数据驱动的决策。我们的框架将提出一些研究方向,以更好地支持小型和农村地区的萎缩,使用交互式仪表板设计,实施和日常分析师的使用。
{"title":"Exploring Rural Shrink Smart Through Guided Discovery Dashboards","authors":"Denise Bradford, Susan Vanderplas","doi":"10.6339/22-jds1080","DOIUrl":"https://doi.org/10.6339/22-jds1080","url":null,"abstract":"Many small and rural places are shrinking. Interactive dashboards are the most common use cases for data visualization and context for exploratory data tools. In our paper, we will use Iowa data to explore the specific scope of how dashboards are used in small and rural area to empower novice analysts to make data-driven decisions. Our framework will suggest a number of research directions to better support small and rural places from shrinking using an interactive dashboard design, implementation and use for the every day analyst.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Improving the Science of Annotation for Natural Language Processing: The Use of the Single-Case Study for Piloting Annotation Projects 改进自然语言处理的标注科学:在试点标注项目中使用单案例研究
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1054
Kylie L. Anglin, Arielle Boguslav, Todd Hall
Researchers need guidance on how to obtain maximum efficiency and accuracy when annotating training data for text classification applications. Further, given wide variability in the kinds of annotations researchers need to obtain, they would benefit from the ability to conduct low-cost experiments during the design phase of annotation projects. To this end, our study proposes the single-case study design as a feasible and causally-valid experimental design for determining the best procedures for a given annotation task. The key strength of the design is its ability to generate causal evidence at the individual level, identifying the impact of competing annotation techniques and interfaces for the specific annotator(s) included in an annotation project. In this paper, we demonstrate the application of the single-case study in an applied experiment and argue that future researchers should incorporate the design into the pilot stage of annotation projects so that, over time, a causally-valid body of knowledge regarding the best annotation techniques is built.
研究人员需要指导如何在为文本分类应用程序注释训练数据时获得最大的效率和准确性。此外,考虑到研究人员需要获得的注释种类的广泛可变性,在注释项目的设计阶段进行低成本实验的能力将使他们受益。为此,我们的研究提出了单案例研究设计作为一种可行且因果有效的实验设计,用于确定给定注释任务的最佳程序。该设计的关键优势在于,它能够在个人层面上生成因果证据,识别相互竞争的注释技术和接口对注释项目中包含的特定注释器的影响。在本文中,我们展示了单案例研究在应用实验中的应用,并认为未来的研究人员应该将该设计纳入注释项目的试点阶段,以便随着时间的推移,建立一个关于最佳注释技术的因果有效的知识体系。
{"title":"Improving the Science of Annotation for Natural Language Processing: The Use of the Single-Case Study for Piloting Annotation Projects","authors":"Kylie L. Anglin, Arielle Boguslav, Todd Hall","doi":"10.6339/22-jds1054","DOIUrl":"https://doi.org/10.6339/22-jds1054","url":null,"abstract":"Researchers need guidance on how to obtain maximum efficiency and accuracy when annotating training data for text classification applications. Further, given wide variability in the kinds of annotations researchers need to obtain, they would benefit from the ability to conduct low-cost experiments during the design phase of annotation projects. To this end, our study proposes the single-case study design as a feasible and causally-valid experimental design for determining the best procedures for a given annotation task. The key strength of the design is its ability to generate causal evidence at the individual level, identifying the impact of competing annotation techniques and interfaces for the specific annotator(s) included in an annotation project. In this paper, we demonstrate the application of the single-case study in an applied experiment and argue that future researchers should incorporate the design into the pilot stage of annotation projects so that, over time, a causally-valid body of knowledge regarding the best annotation techniques is built.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Motion Picture Editing as a Hawkes Process 作为霍克斯过程的电影剪辑
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1055
Nick Redfern
In this article I analyse motion picture editing as a point process to explore the temporal structure in the timings of cuts in motion pictures, modelling the editing in 134 Hollywood films released between 1935 and 2005 as a Hawkes process with an exponential kernel. The results show that the editing in Hollywood films can be modelled as a Hawkes process and that the conditional intensity function provides a direct description of the instantaneous cutting rate of a film, revealing the structure of a film’s editing at a range of scales. The parameters of the exponential kernel show a clear trend over time to a more rapid editing style with an increase in the rate of exogenous events and small increase in the rate of endogenous events. This is consistent with the shift from a classical to an intensified continuity editing style. There are, however, few differences between genres indicating the consistency of editing practices in Hollywood cinema over time and different types of films.
在本文中,我将电影剪辑作为一个点过程来分析,以探索电影剪辑时间的时间结构,并将1935年至2005年间上映的134部好莱坞电影的剪辑建模为具有指数核的霍克斯过程。研究结果表明,好莱坞电影中的剪辑可以被建模为霍克斯过程,条件强度函数可以直接描述电影的瞬时剪辑速率,揭示电影在一定尺度上的剪辑结构。随着时间的推移,指数核的参数显示出一个明显的趋势,即外源事件率的增加和内源事件率的小幅增加,更快速的编辑风格。这与从经典到强化连续性编辑风格的转变是一致的。然而,不同类型之间几乎没有差异,这表明好莱坞电影在不同时间和不同类型的电影中剪辑实践的一致性。
{"title":"Motion Picture Editing as a Hawkes Process","authors":"Nick Redfern","doi":"10.6339/22-jds1055","DOIUrl":"https://doi.org/10.6339/22-jds1055","url":null,"abstract":"In this article I analyse motion picture editing as a point process to explore the temporal structure in the timings of cuts in motion pictures, modelling the editing in 134 Hollywood films released between 1935 and 2005 as a Hawkes process with an exponential kernel. The results show that the editing in Hollywood films can be modelled as a Hawkes process and that the conditional intensity function provides a direct description of the instantaneous cutting rate of a film, revealing the structure of a film’s editing at a range of scales. The parameters of the exponential kernel show a clear trend over time to a more rapid editing style with an increase in the rate of exogenous events and small increase in the rate of endogenous events. This is consistent with the shift from a classical to an intensified continuity editing style. There are, however, few differences between genres indicating the consistency of editing practices in Hollywood cinema over time and different types of films.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling Dynamic Transport Network with Matrix Factor Models: an Application to International Trade Flow 基于矩阵因子模型的动态运输网络建模:在国际贸易流中的应用
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1065
Elynn Y. Chen, Rong Chen
International trade research plays an important role to inform trade policy and shed light on wider economic issues. With recent advances in information technology, economic agencies distribute an enormous amount of internationally comparable trading data, providing a gold mine for empirical analysis of international trade. International trading data can be viewed as a dynamic transport network because it emphasizes the amount of goods moving across network edges. Most literature on dynamic network analysis concentrates on parametric modeling of the connectivity network that focuses on link formation or deformation rather than the transport moving across the network. We take a different non-parametric perspective from the pervasive node-and-edge-level modeling: the dynamic transport network is modeled as a time series of relational matrices; variants of the matrix factor model of Wang et al. (2019) are applied to provide a specific interpretation for the dynamic transport network. Under the model, the observed surface network is assumed to be driven by a latent dynamic transport network with lower dimensions. Our method is able to unveil the latent dynamic structure and achieves the goal of dimension reduction. We applied the proposed method to a dataset of monthly trading volumes among 24 countries (and regions) from 1982 to 2015. Our findings shed light on trading hubs, centrality, trends, and patterns of international trade and show matching change points to trading policies. The dataset also provides a fertile ground for future research on international trade.
国际贸易研究在为贸易政策提供信息和揭示更广泛的经济问题方面发挥着重要作用。随着信息技术的发展,经济机构散布着大量的国际可比贸易数据,为国际贸易的实证分析提供了一座金矿。国际贸易数据可以被看作是一个动态的运输网络,因为它强调了在网络边缘移动的货物数量。大多数关于动态网络分析的文献都集中在连通性网络的参数化建模上,该模型关注的是链路的形成或变形,而不是在网络中移动的传输。我们从普遍的节点和边缘级建模中采取了不同的非参数视角:将动态运输网络建模为关系矩阵的时间序列;本文采用Wang等人(2019)的矩阵因子模型变体,为动态运输网络提供具体解释。在该模型下,假定观测到的地表网络是由一个潜在的低维动态运输网络驱动的。该方法能够揭示潜在的动态结构,达到降维的目的。我们将提出的方法应用于1982年至2015年24个国家(和地区)的月度交易量数据集。我们的发现揭示了贸易中心、中心性、趋势和国际贸易模式,并显示了与贸易政策相匹配的变化点。该数据集也为未来的国际贸易研究提供了肥沃的土壤。
{"title":"Modeling Dynamic Transport Network with Matrix Factor Models: an Application to International Trade Flow","authors":"Elynn Y. Chen, Rong Chen","doi":"10.6339/22-jds1065","DOIUrl":"https://doi.org/10.6339/22-jds1065","url":null,"abstract":"International trade research plays an important role to inform trade policy and shed light on wider economic issues. With recent advances in information technology, economic agencies distribute an enormous amount of internationally comparable trading data, providing a gold mine for empirical analysis of international trade. International trading data can be viewed as a dynamic transport network because it emphasizes the amount of goods moving across network edges. Most literature on dynamic network analysis concentrates on parametric modeling of the connectivity network that focuses on link formation or deformation rather than the transport moving across the network. We take a different non-parametric perspective from the pervasive node-and-edge-level modeling: the dynamic transport network is modeled as a time series of relational matrices; variants of the matrix factor model of Wang et al. (2019) are applied to provide a specific interpretation for the dynamic transport network. Under the model, the observed surface network is assumed to be driven by a latent dynamic transport network with lower dimensions. Our method is able to unveil the latent dynamic structure and achieves the goal of dimension reduction. We applied the proposed method to a dataset of monthly trading volumes among 24 countries (and regions) from 1982 to 2015. Our findings shed light on trading hubs, centrality, trends, and patterns of international trade and show matching change points to trading policies. The dataset also provides a fertile ground for future research on international trade.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Flowthrough Centrality: A Stable Node Centrality Measure 流动中心性:一个稳定的节点中心性度量
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1081
Charles F. Mann, M. McGee, E. Olinick, D. Matula
This paper introduces flowthrough centrality, a node centrality measure determined from the hierarchical maximum concurrent flow problem (HMCFP). Based upon the extent to which a node is acting as a hub within a network, this centrality measure is defined to be the fraction of the flow passing through the node to the total flow capacity of the node. Flowthrough centrality is compared to the commonly-used centralities of closeness centrality, betweenness centrality, and flow betweenness centrality, as well as to stable betweenness centrality to measure the stability (i.e., accuracy) of the centralities when knowledge of the network topology is incomplete or in transition. Perturbations do not alter the flowthrough centrality values of nodes that are based upon flow as much as they do other types of centrality values that are based upon geodesics. The flowthrough centrality measure overcomes the problem of overstating or understating the roles that significant actors play in social networks. The flowthrough centrality is canonical in that it is determined from a natural, realized flow universally applicable to all networks.
本文介绍了从分层最大并发流问题(HMCFP)中确定的节点中心性度量——流经中心性。根据一个节点在网络中充当枢纽的程度,这种中心性度量被定义为流经该节点的流量占该节点总流量的比例。将流量中心性与常用的紧密中心性、中间中心性和流中间中心性进行比较,并与稳定中间中心性进行比较,以衡量网络拓扑知识不完整或处于过渡状态时中心性的稳定性(即准确性)。扰动不会像改变基于测地线的其他类型的中心性值那样改变基于流量的节点的流过中心性值。流动中心性测量克服了夸大或低估重要参与者在社交网络中扮演的角色的问题。流经中心性是典型的,因为它是由一个自然的、普遍适用于所有网络的实现流决定的。
{"title":"Flowthrough Centrality: A Stable Node Centrality Measure","authors":"Charles F. Mann, M. McGee, E. Olinick, D. Matula","doi":"10.6339/22-jds1081","DOIUrl":"https://doi.org/10.6339/22-jds1081","url":null,"abstract":"This paper introduces flowthrough centrality, a node centrality measure determined from the hierarchical maximum concurrent flow problem (HMCFP). Based upon the extent to which a node is acting as a hub within a network, this centrality measure is defined to be the fraction of the flow passing through the node to the total flow capacity of the node. Flowthrough centrality is compared to the commonly-used centralities of closeness centrality, betweenness centrality, and flow betweenness centrality, as well as to stable betweenness centrality to measure the stability (i.e., accuracy) of the centralities when knowledge of the network topology is incomplete or in transition. Perturbations do not alter the flowthrough centrality values of nodes that are based upon flow as much as they do other types of centrality values that are based upon geodesics. The flowthrough centrality measure overcomes the problem of overstating or understating the roles that significant actors play in social networks. The flowthrough centrality is canonical in that it is determined from a natural, realized flow universally applicable to all networks.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Inference for Optimal Differential Privacy Procedures for Frequency Tables 频率表的最优微分隐私过程的推理
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1044
Chengcheng Li, Na Wang, Gongjun Xu
When releasing data to the public, a vital concern is the risk of exposing personal information of the individuals who have contributed to the data set. Many mechanisms have been proposed to protect individual privacy, though less attention has been dedicated to practically conducting valid inferences on the altered privacy-protected data sets. For frequency tables, the privacy-protection-oriented perturbations often lead to negative cell counts. Releasing such tables can undermine users’ confidence in the usefulness of such data sets. This paper focuses on releasing one-way frequency tables. We recommend an optimal mechanism that satisfies ϵ-differential privacy (DP) without suffering from having negative cell counts. The procedure is optimal in the sense that the expected utility is maximized under a given privacy constraint. Valid inference procedures for testing goodness-of-fit are also developed for the DP privacy-protected data. In particular, we propose a de-biased test statistic for the optimal procedure and derive its asymptotic distribution. In addition, we also introduce testing procedures for the commonly used Laplace and Gaussian mechanisms, which provide a good finite sample approximation for the null distributions. Moreover, the decaying rate requirements for the privacy regime are provided for the inference procedures to be valid. We further consider common users’ practices such as merging related or neighboring cells or integrating statistical information obtained across different data sources and derive valid testing procedures when these operations occur. Simulation studies show that our inference results hold well even when the sample size is relatively small. Comparisons with the current field standards, including the Laplace, the Gaussian (both with/without post-processing of replacing negative cell counts with zeros), and the Binomial-Beta McClure-Reiter mechanisms, are carried out. In the end, we apply our method to the National Center for Early Development and Learning’s (NCEDL) multi-state studies data to demonstrate its practical applicability.
在向公众发布数据时,一个至关重要的问题是有可能暴露为数据集做出贡献的个人的个人信息。已经提出了许多机制来保护个人隐私,尽管很少有人关注对改变的隐私保护数据集进行实际有效的推断。对于频率表,隐私保护导向的扰动通常导致负细胞计数。发布这样的表格会削弱用户对这些数据集有用性的信心。本文的重点是发布单向频率表。我们推荐一个最优的机制,满足ϵ-differential隐私(DP),而不会遭受负细胞计数。这个过程是最优的,因为在给定的隐私约束下,预期效用是最大化的。对于DP隐私保护数据,还开发了测试拟合优度的有效推理程序。特别地,我们提出了最优过程的去偏检验统计量,并推导了它的渐近分布。此外,我们还介绍了常用的拉普拉斯和高斯机制的测试程序,它们为零分布提供了良好的有限样本近似。此外,还提供了隐私制度的衰减率要求,以使推理过程有效。我们进一步考虑常见的用户实践,例如合并相关或相邻的单元格,或集成跨不同数据源获得的统计信息,并在这些操作发生时推导出有效的测试过程。仿真研究表明,即使在样本量相对较小的情况下,我们的推断结果也保持良好。与当前的领域标准进行了比较,包括拉普拉斯,高斯(都有/没有用零替换负细胞计数的后处理)和二项式- β麦克卢尔-瑞特机制。最后,我们将我们的方法应用于国家早期发展和学习中心(NCEDL)的多州研究数据,以证明其实际适用性。
{"title":"Inference for Optimal Differential Privacy Procedures for Frequency Tables","authors":"Chengcheng Li, Na Wang, Gongjun Xu","doi":"10.6339/22-jds1044","DOIUrl":"https://doi.org/10.6339/22-jds1044","url":null,"abstract":"When releasing data to the public, a vital concern is the risk of exposing personal information of the individuals who have contributed to the data set. Many mechanisms have been proposed to protect individual privacy, though less attention has been dedicated to practically conducting valid inferences on the altered privacy-protected data sets. For frequency tables, the privacy-protection-oriented perturbations often lead to negative cell counts. Releasing such tables can undermine users’ confidence in the usefulness of such data sets. This paper focuses on releasing one-way frequency tables. We recommend an optimal mechanism that satisfies ϵ-differential privacy (DP) without suffering from having negative cell counts. The procedure is optimal in the sense that the expected utility is maximized under a given privacy constraint. Valid inference procedures for testing goodness-of-fit are also developed for the DP privacy-protected data. In particular, we propose a de-biased test statistic for the optimal procedure and derive its asymptotic distribution. In addition, we also introduce testing procedures for the commonly used Laplace and Gaussian mechanisms, which provide a good finite sample approximation for the null distributions. Moreover, the decaying rate requirements for the privacy regime are provided for the inference procedures to be valid. We further consider common users’ practices such as merging related or neighboring cells or integrating statistical information obtained across different data sources and derive valid testing procedures when these operations occur. Simulation studies show that our inference results hold well even when the sample size is relatively small. Comparisons with the current field standards, including the Laplace, the Gaussian (both with/without post-processing of replacing negative cell counts with zeros), and the Binomial-Beta McClure-Reiter mechanisms, are carried out. In the end, we apply our method to the National Center for Early Development and Learning’s (NCEDL) multi-state studies data to demonstrate its practical applicability.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparison of Methods for Imputing Social Network Data 社会网络数据输入方法的比较
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1045
Ziqian Xu, Jiarui Hai, Yutong Yang, Zhiyong Zhang
Social network data often contain missing values because of the sensitive nature of the information collected and the dependency among the network actors. As a response, network imputation methods including simple ones constructed from network structural characteristics and more complicated model-based ones have been developed. Although past studies have explored the influence of missing data on social networks and the effectiveness of imputation procedures in many missing data conditions, the current study aims to evaluate a more extensive set of eight network imputation techniques (i.e., null-tie, Reconstruction, Preferential Attachment, Constrained Random Dot Product Graph, Multiple Imputation by Bayesian Exponential Random Graph Models or BERGMs, k-Nearest Neighbors, Random Forest, and Multiple Imputation by Chained Equations) under more practical conditions through comprehensive simulation. A factorial design for missing data conditions is adopted with factors including missing data types, missing data mechanisms, and missing data proportions, which are applied to generated social networks with varying numbers of actors based on 4 different sets of coefficients in ERGMs. Results show that the effectiveness of imputation methods differs by missing data types, missing data mechanisms, the evaluation criteria used, and the complexity of the social networks. More complex methods such as the BERGMs have consistently good performances in recovering missing edges that should have been present. While simpler methods like Reconstruction work better in recovering network statistics when the missing proportion of present edges is low, the BERGMs work better when more present edges are missing. The BERGMs also work well in recovering ERGM coefficients when the networks are complex and the missing data type is actor non-response. In conclusion, researchers analyzing social networks with incomplete data should identify the network structures of interest and the potential missing data types before selecting appropriate imputation methods.
由于所收集信息的敏感性和网络参与者之间的依赖性,社交网络数据经常包含缺失值。为此,人们开发了基于网络结构特征的简单网络插值方法和更复杂的基于模型的网络插值方法。虽然过去的研究已经探讨了缺失数据对社会网络的影响,以及在许多缺失数据条件下的补全程序的有效性,但目前的研究旨在评估更广泛的八种网络补全技术(即:null-tie、重建、优先依恋、约束随机点积图、贝叶斯指数随机图模型(BERGMs)的多次补全、k-近邻、随机森林、以及链式方程组的多次插补)。采用缺失数据条件的析因设计,其中包括缺失数据类型、缺失数据机制和缺失数据比例等因素,并基于ergm中的4组不同系数,将其应用于生成的具有不同数量参与者的社交网络。结果表明,缺失数据类型、缺失数据机制、评估标准和社会网络的复杂程度不同,不同方法的有效性也不同。更复杂的方法,如bergm,在恢复本该存在的缺失边缘方面一直表现良好。当当前边缺失比例较低时,像Reconstruction这样简单的方法在恢复网络统计数据方面效果更好,而当当前边缺失较多时,bergm方法效果更好。当网络复杂且缺失的数据类型为参与者非响应时,bergm也能很好地恢复ERGM系数。综上所述,研究人员在分析数据不完整的社会网络时,应该先确定感兴趣的网络结构和潜在的缺失数据类型,然后再选择合适的归算方法。
{"title":"Comparison of Methods for Imputing Social Network Data","authors":"Ziqian Xu, Jiarui Hai, Yutong Yang, Zhiyong Zhang","doi":"10.6339/22-jds1045","DOIUrl":"https://doi.org/10.6339/22-jds1045","url":null,"abstract":"Social network data often contain missing values because of the sensitive nature of the information collected and the dependency among the network actors. As a response, network imputation methods including simple ones constructed from network structural characteristics and more complicated model-based ones have been developed. Although past studies have explored the influence of missing data on social networks and the effectiveness of imputation procedures in many missing data conditions, the current study aims to evaluate a more extensive set of eight network imputation techniques (i.e., null-tie, Reconstruction, Preferential Attachment, Constrained Random Dot Product Graph, Multiple Imputation by Bayesian Exponential Random Graph Models or BERGMs, k-Nearest Neighbors, Random Forest, and Multiple Imputation by Chained Equations) under more practical conditions through comprehensive simulation. A factorial design for missing data conditions is adopted with factors including missing data types, missing data mechanisms, and missing data proportions, which are applied to generated social networks with varying numbers of actors based on 4 different sets of coefficients in ERGMs. Results show that the effectiveness of imputation methods differs by missing data types, missing data mechanisms, the evaluation criteria used, and the complexity of the social networks. More complex methods such as the BERGMs have consistently good performances in recovering missing edges that should have been present. While simpler methods like Reconstruction work better in recovering network statistics when the missing proportion of present edges is low, the BERGMs work better when more present edges are missing. The BERGMs also work well in recovering ERGM coefficients when the networks are complex and the missing data type is actor non-response. In conclusion, researchers analyzing social networks with incomplete data should identify the network structures of interest and the potential missing data types before selecting appropriate imputation methods.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Journal of data science : JDS
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1