首页 > 最新文献

Journal of data science : JDS最新文献

英文 中文
Improving the Science of Annotation for Natural Language Processing: The Use of the Single-Case Study for Piloting Annotation Projects 改进自然语言处理的标注科学:在试点标注项目中使用单案例研究
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1054
Kylie L. Anglin, Arielle Boguslav, Todd Hall
Researchers need guidance on how to obtain maximum efficiency and accuracy when annotating training data for text classification applications. Further, given wide variability in the kinds of annotations researchers need to obtain, they would benefit from the ability to conduct low-cost experiments during the design phase of annotation projects. To this end, our study proposes the single-case study design as a feasible and causally-valid experimental design for determining the best procedures for a given annotation task. The key strength of the design is its ability to generate causal evidence at the individual level, identifying the impact of competing annotation techniques and interfaces for the specific annotator(s) included in an annotation project. In this paper, we demonstrate the application of the single-case study in an applied experiment and argue that future researchers should incorporate the design into the pilot stage of annotation projects so that, over time, a causally-valid body of knowledge regarding the best annotation techniques is built.
研究人员需要指导如何在为文本分类应用程序注释训练数据时获得最大的效率和准确性。此外,考虑到研究人员需要获得的注释种类的广泛可变性,在注释项目的设计阶段进行低成本实验的能力将使他们受益。为此,我们的研究提出了单案例研究设计作为一种可行且因果有效的实验设计,用于确定给定注释任务的最佳程序。该设计的关键优势在于,它能够在个人层面上生成因果证据,识别相互竞争的注释技术和接口对注释项目中包含的特定注释器的影响。在本文中,我们展示了单案例研究在应用实验中的应用,并认为未来的研究人员应该将该设计纳入注释项目的试点阶段,以便随着时间的推移,建立一个关于最佳注释技术的因果有效的知识体系。
{"title":"Improving the Science of Annotation for Natural Language Processing: The Use of the Single-Case Study for Piloting Annotation Projects","authors":"Kylie L. Anglin, Arielle Boguslav, Todd Hall","doi":"10.6339/22-jds1054","DOIUrl":"https://doi.org/10.6339/22-jds1054","url":null,"abstract":"Researchers need guidance on how to obtain maximum efficiency and accuracy when annotating training data for text classification applications. Further, given wide variability in the kinds of annotations researchers need to obtain, they would benefit from the ability to conduct low-cost experiments during the design phase of annotation projects. To this end, our study proposes the single-case study design as a feasible and causally-valid experimental design for determining the best procedures for a given annotation task. The key strength of the design is its ability to generate causal evidence at the individual level, identifying the impact of competing annotation techniques and interfaces for the specific annotator(s) included in an annotation project. In this paper, we demonstrate the application of the single-case study in an applied experiment and argue that future researchers should incorporate the design into the pilot stage of annotation projects so that, over time, a causally-valid body of knowledge regarding the best annotation techniques is built.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Motion Picture Editing as a Hawkes Process 作为霍克斯过程的电影剪辑
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1055
Nick Redfern
In this article I analyse motion picture editing as a point process to explore the temporal structure in the timings of cuts in motion pictures, modelling the editing in 134 Hollywood films released between 1935 and 2005 as a Hawkes process with an exponential kernel. The results show that the editing in Hollywood films can be modelled as a Hawkes process and that the conditional intensity function provides a direct description of the instantaneous cutting rate of a film, revealing the structure of a film’s editing at a range of scales. The parameters of the exponential kernel show a clear trend over time to a more rapid editing style with an increase in the rate of exogenous events and small increase in the rate of endogenous events. This is consistent with the shift from a classical to an intensified continuity editing style. There are, however, few differences between genres indicating the consistency of editing practices in Hollywood cinema over time and different types of films.
在本文中,我将电影剪辑作为一个点过程来分析,以探索电影剪辑时间的时间结构,并将1935年至2005年间上映的134部好莱坞电影的剪辑建模为具有指数核的霍克斯过程。研究结果表明,好莱坞电影中的剪辑可以被建模为霍克斯过程,条件强度函数可以直接描述电影的瞬时剪辑速率,揭示电影在一定尺度上的剪辑结构。随着时间的推移,指数核的参数显示出一个明显的趋势,即外源事件率的增加和内源事件率的小幅增加,更快速的编辑风格。这与从经典到强化连续性编辑风格的转变是一致的。然而,不同类型之间几乎没有差异,这表明好莱坞电影在不同时间和不同类型的电影中剪辑实践的一致性。
{"title":"Motion Picture Editing as a Hawkes Process","authors":"Nick Redfern","doi":"10.6339/22-jds1055","DOIUrl":"https://doi.org/10.6339/22-jds1055","url":null,"abstract":"In this article I analyse motion picture editing as a point process to explore the temporal structure in the timings of cuts in motion pictures, modelling the editing in 134 Hollywood films released between 1935 and 2005 as a Hawkes process with an exponential kernel. The results show that the editing in Hollywood films can be modelled as a Hawkes process and that the conditional intensity function provides a direct description of the instantaneous cutting rate of a film, revealing the structure of a film’s editing at a range of scales. The parameters of the exponential kernel show a clear trend over time to a more rapid editing style with an increase in the rate of exogenous events and small increase in the rate of endogenous events. This is consistent with the shift from a classical to an intensified continuity editing style. There are, however, few differences between genres indicating the consistency of editing practices in Hollywood cinema over time and different types of films.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling Dynamic Transport Network with Matrix Factor Models: an Application to International Trade Flow 基于矩阵因子模型的动态运输网络建模:在国际贸易流中的应用
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1065
Elynn Y. Chen, Rong Chen
International trade research plays an important role to inform trade policy and shed light on wider economic issues. With recent advances in information technology, economic agencies distribute an enormous amount of internationally comparable trading data, providing a gold mine for empirical analysis of international trade. International trading data can be viewed as a dynamic transport network because it emphasizes the amount of goods moving across network edges. Most literature on dynamic network analysis concentrates on parametric modeling of the connectivity network that focuses on link formation or deformation rather than the transport moving across the network. We take a different non-parametric perspective from the pervasive node-and-edge-level modeling: the dynamic transport network is modeled as a time series of relational matrices; variants of the matrix factor model of Wang et al. (2019) are applied to provide a specific interpretation for the dynamic transport network. Under the model, the observed surface network is assumed to be driven by a latent dynamic transport network with lower dimensions. Our method is able to unveil the latent dynamic structure and achieves the goal of dimension reduction. We applied the proposed method to a dataset of monthly trading volumes among 24 countries (and regions) from 1982 to 2015. Our findings shed light on trading hubs, centrality, trends, and patterns of international trade and show matching change points to trading policies. The dataset also provides a fertile ground for future research on international trade.
国际贸易研究在为贸易政策提供信息和揭示更广泛的经济问题方面发挥着重要作用。随着信息技术的发展,经济机构散布着大量的国际可比贸易数据,为国际贸易的实证分析提供了一座金矿。国际贸易数据可以被看作是一个动态的运输网络,因为它强调了在网络边缘移动的货物数量。大多数关于动态网络分析的文献都集中在连通性网络的参数化建模上,该模型关注的是链路的形成或变形,而不是在网络中移动的传输。我们从普遍的节点和边缘级建模中采取了不同的非参数视角:将动态运输网络建模为关系矩阵的时间序列;本文采用Wang等人(2019)的矩阵因子模型变体,为动态运输网络提供具体解释。在该模型下,假定观测到的地表网络是由一个潜在的低维动态运输网络驱动的。该方法能够揭示潜在的动态结构,达到降维的目的。我们将提出的方法应用于1982年至2015年24个国家(和地区)的月度交易量数据集。我们的发现揭示了贸易中心、中心性、趋势和国际贸易模式,并显示了与贸易政策相匹配的变化点。该数据集也为未来的国际贸易研究提供了肥沃的土壤。
{"title":"Modeling Dynamic Transport Network with Matrix Factor Models: an Application to International Trade Flow","authors":"Elynn Y. Chen, Rong Chen","doi":"10.6339/22-jds1065","DOIUrl":"https://doi.org/10.6339/22-jds1065","url":null,"abstract":"International trade research plays an important role to inform trade policy and shed light on wider economic issues. With recent advances in information technology, economic agencies distribute an enormous amount of internationally comparable trading data, providing a gold mine for empirical analysis of international trade. International trading data can be viewed as a dynamic transport network because it emphasizes the amount of goods moving across network edges. Most literature on dynamic network analysis concentrates on parametric modeling of the connectivity network that focuses on link formation or deformation rather than the transport moving across the network. We take a different non-parametric perspective from the pervasive node-and-edge-level modeling: the dynamic transport network is modeled as a time series of relational matrices; variants of the matrix factor model of Wang et al. (2019) are applied to provide a specific interpretation for the dynamic transport network. Under the model, the observed surface network is assumed to be driven by a latent dynamic transport network with lower dimensions. Our method is able to unveil the latent dynamic structure and achieves the goal of dimension reduction. We applied the proposed method to a dataset of monthly trading volumes among 24 countries (and regions) from 1982 to 2015. Our findings shed light on trading hubs, centrality, trends, and patterns of international trade and show matching change points to trading policies. The dataset also provides a fertile ground for future research on international trade.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Flowthrough Centrality: A Stable Node Centrality Measure 流动中心性:一个稳定的节点中心性度量
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1081
Charles F. Mann, M. McGee, E. Olinick, D. Matula
This paper introduces flowthrough centrality, a node centrality measure determined from the hierarchical maximum concurrent flow problem (HMCFP). Based upon the extent to which a node is acting as a hub within a network, this centrality measure is defined to be the fraction of the flow passing through the node to the total flow capacity of the node. Flowthrough centrality is compared to the commonly-used centralities of closeness centrality, betweenness centrality, and flow betweenness centrality, as well as to stable betweenness centrality to measure the stability (i.e., accuracy) of the centralities when knowledge of the network topology is incomplete or in transition. Perturbations do not alter the flowthrough centrality values of nodes that are based upon flow as much as they do other types of centrality values that are based upon geodesics. The flowthrough centrality measure overcomes the problem of overstating or understating the roles that significant actors play in social networks. The flowthrough centrality is canonical in that it is determined from a natural, realized flow universally applicable to all networks.
本文介绍了从分层最大并发流问题(HMCFP)中确定的节点中心性度量——流经中心性。根据一个节点在网络中充当枢纽的程度,这种中心性度量被定义为流经该节点的流量占该节点总流量的比例。将流量中心性与常用的紧密中心性、中间中心性和流中间中心性进行比较,并与稳定中间中心性进行比较,以衡量网络拓扑知识不完整或处于过渡状态时中心性的稳定性(即准确性)。扰动不会像改变基于测地线的其他类型的中心性值那样改变基于流量的节点的流过中心性值。流动中心性测量克服了夸大或低估重要参与者在社交网络中扮演的角色的问题。流经中心性是典型的,因为它是由一个自然的、普遍适用于所有网络的实现流决定的。
{"title":"Flowthrough Centrality: A Stable Node Centrality Measure","authors":"Charles F. Mann, M. McGee, E. Olinick, D. Matula","doi":"10.6339/22-jds1081","DOIUrl":"https://doi.org/10.6339/22-jds1081","url":null,"abstract":"This paper introduces flowthrough centrality, a node centrality measure determined from the hierarchical maximum concurrent flow problem (HMCFP). Based upon the extent to which a node is acting as a hub within a network, this centrality measure is defined to be the fraction of the flow passing through the node to the total flow capacity of the node. Flowthrough centrality is compared to the commonly-used centralities of closeness centrality, betweenness centrality, and flow betweenness centrality, as well as to stable betweenness centrality to measure the stability (i.e., accuracy) of the centralities when knowledge of the network topology is incomplete or in transition. Perturbations do not alter the flowthrough centrality values of nodes that are based upon flow as much as they do other types of centrality values that are based upon geodesics. The flowthrough centrality measure overcomes the problem of overstating or understating the roles that significant actors play in social networks. The flowthrough centrality is canonical in that it is determined from a natural, realized flow universally applicable to all networks.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparison of Methods for Imputing Social Network Data 社会网络数据输入方法的比较
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1045
Ziqian Xu, Jiarui Hai, Yutong Yang, Zhiyong Zhang
Social network data often contain missing values because of the sensitive nature of the information collected and the dependency among the network actors. As a response, network imputation methods including simple ones constructed from network structural characteristics and more complicated model-based ones have been developed. Although past studies have explored the influence of missing data on social networks and the effectiveness of imputation procedures in many missing data conditions, the current study aims to evaluate a more extensive set of eight network imputation techniques (i.e., null-tie, Reconstruction, Preferential Attachment, Constrained Random Dot Product Graph, Multiple Imputation by Bayesian Exponential Random Graph Models or BERGMs, k-Nearest Neighbors, Random Forest, and Multiple Imputation by Chained Equations) under more practical conditions through comprehensive simulation. A factorial design for missing data conditions is adopted with factors including missing data types, missing data mechanisms, and missing data proportions, which are applied to generated social networks with varying numbers of actors based on 4 different sets of coefficients in ERGMs. Results show that the effectiveness of imputation methods differs by missing data types, missing data mechanisms, the evaluation criteria used, and the complexity of the social networks. More complex methods such as the BERGMs have consistently good performances in recovering missing edges that should have been present. While simpler methods like Reconstruction work better in recovering network statistics when the missing proportion of present edges is low, the BERGMs work better when more present edges are missing. The BERGMs also work well in recovering ERGM coefficients when the networks are complex and the missing data type is actor non-response. In conclusion, researchers analyzing social networks with incomplete data should identify the network structures of interest and the potential missing data types before selecting appropriate imputation methods.
由于所收集信息的敏感性和网络参与者之间的依赖性,社交网络数据经常包含缺失值。为此,人们开发了基于网络结构特征的简单网络插值方法和更复杂的基于模型的网络插值方法。虽然过去的研究已经探讨了缺失数据对社会网络的影响,以及在许多缺失数据条件下的补全程序的有效性,但目前的研究旨在评估更广泛的八种网络补全技术(即:null-tie、重建、优先依恋、约束随机点积图、贝叶斯指数随机图模型(BERGMs)的多次补全、k-近邻、随机森林、以及链式方程组的多次插补)。采用缺失数据条件的析因设计,其中包括缺失数据类型、缺失数据机制和缺失数据比例等因素,并基于ergm中的4组不同系数,将其应用于生成的具有不同数量参与者的社交网络。结果表明,缺失数据类型、缺失数据机制、评估标准和社会网络的复杂程度不同,不同方法的有效性也不同。更复杂的方法,如bergm,在恢复本该存在的缺失边缘方面一直表现良好。当当前边缺失比例较低时,像Reconstruction这样简单的方法在恢复网络统计数据方面效果更好,而当当前边缺失较多时,bergm方法效果更好。当网络复杂且缺失的数据类型为参与者非响应时,bergm也能很好地恢复ERGM系数。综上所述,研究人员在分析数据不完整的社会网络时,应该先确定感兴趣的网络结构和潜在的缺失数据类型,然后再选择合适的归算方法。
{"title":"Comparison of Methods for Imputing Social Network Data","authors":"Ziqian Xu, Jiarui Hai, Yutong Yang, Zhiyong Zhang","doi":"10.6339/22-jds1045","DOIUrl":"https://doi.org/10.6339/22-jds1045","url":null,"abstract":"Social network data often contain missing values because of the sensitive nature of the information collected and the dependency among the network actors. As a response, network imputation methods including simple ones constructed from network structural characteristics and more complicated model-based ones have been developed. Although past studies have explored the influence of missing data on social networks and the effectiveness of imputation procedures in many missing data conditions, the current study aims to evaluate a more extensive set of eight network imputation techniques (i.e., null-tie, Reconstruction, Preferential Attachment, Constrained Random Dot Product Graph, Multiple Imputation by Bayesian Exponential Random Graph Models or BERGMs, k-Nearest Neighbors, Random Forest, and Multiple Imputation by Chained Equations) under more practical conditions through comprehensive simulation. A factorial design for missing data conditions is adopted with factors including missing data types, missing data mechanisms, and missing data proportions, which are applied to generated social networks with varying numbers of actors based on 4 different sets of coefficients in ERGMs. Results show that the effectiveness of imputation methods differs by missing data types, missing data mechanisms, the evaluation criteria used, and the complexity of the social networks. More complex methods such as the BERGMs have consistently good performances in recovering missing edges that should have been present. While simpler methods like Reconstruction work better in recovering network statistics when the missing proportion of present edges is low, the BERGMs work better when more present edges are missing. The BERGMs also work well in recovering ERGM coefficients when the networks are complex and the missing data type is actor non-response. In conclusion, researchers analyzing social networks with incomplete data should identify the network structures of interest and the potential missing data types before selecting appropriate imputation methods.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Computational Challenges of t and Related Copulas t和相关copula的计算挑战
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1034
Erik Hintz, M. Hofert, C. Lemieux
The present paper addresses computational and numerical challenges when working with t copulas and their more complicated extensions, the grouped t and skew t copulas. We demonstrate how the R package nvmix can be used to work with these copulas. In particular, we discuss (quasi-)random sampling and fitting. We highlight the difficulties arising from using more complicated models, such as the lack of availability of a joint density function or the lack of an analytical form of the marginal quantile functions, and give possible solutions along with future research ideas.
本文解决了计算和数值上的挑战,当处理t轴及其更复杂的扩展,分组t轴和倾斜t轴时。我们将演示如何使用R包nvmix来处理这些copula。特别地,我们讨论了(拟)随机抽样和拟合。我们强调了使用更复杂的模型所产生的困难,例如缺乏联合密度函数的可用性或缺乏边际分位数函数的分析形式,并给出了可能的解决方案以及未来的研究思路。
{"title":"Computational Challenges of t and Related Copulas","authors":"Erik Hintz, M. Hofert, C. Lemieux","doi":"10.6339/22-jds1034","DOIUrl":"https://doi.org/10.6339/22-jds1034","url":null,"abstract":"The present paper addresses computational and numerical challenges when working with t copulas and their more complicated extensions, the grouped t and skew t copulas. We demonstrate how the R package nvmix can be used to work with these copulas. In particular, we discuss (quasi-)random sampling and fitting. We highlight the difficulties arising from using more complicated models, such as the lack of availability of a joint density function or the lack of an analytical form of the marginal quantile functions, and give possible solutions along with future research ideas.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Clustering US States by Time Series of COVID-19 New Case Counts in the Early Months with Non-Negative Matrix Factorization 用非负矩阵分解法根据COVID-19早期新病例数时间序列聚类美国各州
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1036
Jianmin Chen, Panpan Zhang
The spreading pattern of COVID-19 in the early months of the pandemic differs a lot across the states in the US under different quarantine measures and reopening policies. We proposed to cluster the US states into distinct communities based on the daily new confirmed case counts from March 22 to July 25 via a nonnegative matrix factorization (NMF) followed by a k-means clustering procedure on the coefficients of the NMF basis. A cross-validation method was employed to select the rank of the NMF. The method clustered the 49 continental states (including the District of Columbia) into 7 groups, two of which contained a single state. To investigate the dynamics of the clustering results over time, the same method was successively applied to the time periods with an increment of one week, starting from the period of March 22 to March 28. The results suggested a change point in the clustering in the week starting on May 30, caused by a combined impact of both quarantine measures and reopening policies.
根据不同的隔离措施和重新开放政策,美国各州在大流行最初几个月的COVID-19传播模式差异很大。我们建议根据3月22日至7月25日每日新增确诊病例数,通过非负矩阵分解(NMF)和基于NMF基础系数的k-均值聚类程序,将美国各州聚类为不同的社区。采用交叉验证法选择NMF的等级。该方法将49个大陆州(包括哥伦比亚特区)分为7组,其中两组包含一个州。为了研究聚类结果随时间的动态变化,从3月22日到3月28日,连续采用相同的方法,每增加一周。结果表明,从5月30日开始的一周内,由于隔离措施和重新开放政策的共同影响,聚集性出现了一个变化点。
{"title":"Clustering US States by Time Series of COVID-19 New Case Counts in the Early Months with Non-Negative Matrix Factorization","authors":"Jianmin Chen, Panpan Zhang","doi":"10.6339/22-jds1036","DOIUrl":"https://doi.org/10.6339/22-jds1036","url":null,"abstract":"The spreading pattern of COVID-19 in the early months of the pandemic differs a lot across the states in the US under different quarantine measures and reopening policies. We proposed to cluster the US states into distinct communities based on the daily new confirmed case counts from March 22 to July 25 via a nonnegative matrix factorization (NMF) followed by a k-means clustering procedure on the coefficients of the NMF basis. A cross-validation method was employed to select the rank of the NMF. The method clustered the 49 continental states (including the District of Columbia) into 7 groups, two of which contained a single state. To investigate the dynamics of the clustering results over time, the same method was successively applied to the time periods with an increment of one week, starting from the period of March 22 to March 28. The results suggested a change point in the clustering in the week starting on May 30, caused by a combined impact of both quarantine measures and reopening policies.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Is the Group Structure Important in Grouped Functional Time Series? 分组函数时间序列中的分组结构重要吗?
Pub Date : 2021-11-08 DOI: 10.6339/21-jds1031
Yang Yang, H. Shang
We study the importance of group structure in grouped functional time series. Due to the non-uniqueness of group structure, we investigate different disaggregation structures in grouped functional time series. We address a practical question on whether or not the group structure can affect forecast accuracy. Using a dynamic multivariate functional time series method, we consider joint modeling and forecasting multiple series. Illustrated by Japanese sub-national age-specific mortality rates from 1975 to 2016, we investigate one- to 15-step-ahead point and interval forecast accuracies for the two group structures.
我们研究了群结构在分组函数时间序列中的重要性。由于群结构的非唯一性,我们研究了分组函数时间序列中不同的分解结构。我们解决了一个实际问题,即群体结构是否会影响预测准确性。使用动态多变量函数时间序列方法,我们考虑对多个序列进行联合建模和预测。以1975年至2016年日本亚国家特定年龄死亡率为例,我们调查了两组结构的提前1至15步的点和区间预测准确性。
{"title":"Is the Group Structure Important in Grouped Functional Time Series?","authors":"Yang Yang, H. Shang","doi":"10.6339/21-jds1031","DOIUrl":"https://doi.org/10.6339/21-jds1031","url":null,"abstract":"We study the importance of group structure in grouped functional time series. Due to the non-uniqueness of group structure, we investigate different disaggregation structures in grouped functional time series. We address a practical question on whether or not the group structure can affect forecast accuracy. Using a dynamic multivariate functional time series method, we consider joint modeling and forecasting multiple series. Illustrated by Japanese sub-national age-specific mortality rates from 1975 to 2016, we investigate one- to 15-step-ahead point and interval forecast accuracies for the two group structures.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42005832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Privacy-Preserving Inference on the Ratio of Two Gaussians Using Sums 基于和的两个Gaussian比率的保密推理
Pub Date : 2021-10-28 DOI: 10.6339/22-jds1050
Jingang Miao, Yiming Paul Li
The ratio of two Gaussians is useful in many contexts of statistical inference. We discuss statistically valid inference of the ratio under Differential Privacy (DP). We use the delta method to derive the asymptotic distribution of the ratio estimator and use the Gaussian mechanism to provide (epsilon, delta)-DP guarantees. Like many statistics, quantities involved in the inference of a ratio can be re-written as functions of sums, and sums are easy to work with for many reasons. In the context of DP, the sensitivity of a sum is easy to calculate. We focus on getting the correct coverage probability of 95% confidence intervals (CIs) of the DP ratio estimator. Our simulations show that the no-correction method, which ignores the DP noise, gives CIs that are too narrow to provide proper coverage for small samples. In our specific simulation scenario, the coverage of 95% CIs can be as low as below 10%. We propose two methods to mitigate the under-coverage issue, one based on Monte Carlo simulation and the other based on analytical correction. We show that the CIs of our methods have much better coverage with reasonable privacy budgets. In addition, our methods can handle weighted data, when the weights are fixed and bounded.
两个高斯的比率在统计推断的许多情况下都是有用的。我们讨论了在差分隐私(DP)条件下对比率的统计有效推断。我们使用delta方法来导出比率估计器的渐近分布,并使用高斯机制来提供(ε,delta)-DP保证。像许多统计学一样,比率推断中涉及的量可以重写为和的函数,由于多种原因,和很容易处理。在DP的上下文中,求和的灵敏度很容易计算。我们专注于获得DP比率估计器的95%置信区间(CI)的正确覆盖概率。我们的模拟表明,忽略DP噪声的无校正方法给出的CI太窄,无法为小样本提供适当的覆盖范围。在我们的特定模拟场景中,95%CI的覆盖率可以低至10%以下。我们提出了两种方法来缓解覆盖不足问题,一种是基于蒙特卡罗模拟,另一种是分析校正。我们表明,在合理的隐私预算下,我们方法的CI具有更好的覆盖率。此外,当权重是固定的和有界的时,我们的方法可以处理加权数据。
{"title":"Privacy-Preserving Inference on the Ratio of Two Gaussians Using Sums","authors":"Jingang Miao, Yiming Paul Li","doi":"10.6339/22-jds1050","DOIUrl":"https://doi.org/10.6339/22-jds1050","url":null,"abstract":"The ratio of two Gaussians is useful in many contexts of statistical inference. We discuss statistically valid inference of the ratio under Differential Privacy (DP). We use the delta method to derive the asymptotic distribution of the ratio estimator and use the Gaussian mechanism to provide (epsilon, delta)-DP guarantees. Like many statistics, quantities involved in the inference of a ratio can be re-written as functions of sums, and sums are easy to work with for many reasons. In the context of DP, the sensitivity of a sum is easy to calculate. We focus on getting the correct coverage probability of 95% confidence intervals (CIs) of the DP ratio estimator. Our simulations show that the no-correction method, which ignores the DP noise, gives CIs that are too narrow to provide proper coverage for small samples. In our specific simulation scenario, the coverage of 95% CIs can be as low as below 10%. We propose two methods to mitigate the under-coverage issue, one based on Monte Carlo simulation and the other based on analytical correction. We show that the CIs of our methods have much better coverage with reasonable privacy budgets. In addition, our methods can handle weighted data, when the weights are fixed and bounded.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42315729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On a Weibull-Distributed Error Component of a Multiplicative Error Model Under Inverse Square Root Transformation 方根反变换下乘法误差模型的威布尔分布误差分量
Pub Date : 2021-10-12 DOI: 10.11648/J.IJDSA.20210704.12
C. U. Onyemachi, S. Onyeagu, Samuel Ademola Phillips, Jamiu Adebowale Oke, Callistus Ezekwe Ugwo
We first consider the Multiplicative Error Model (MEM) introduced in financial econometrics by Engle (2002) as a general class of time series model for positive-valued random variables, which are decomposed into the product of their conditional mean and a positive-valued error term. Considering the possibility that the error component of a MEM can be a Weibull distribution and the need for data transformation as a popular remedial measure to stabilize the variance of a data set prior to statistical modeling, this paper investigates the impact of the inverse square root transformation (ISRT) on the mean and variance of a Weibull-distributed error component of a MEM. The mean and variance of the Weibull distribution and those of the inverse square root transformed distribution are calculated for σ=6, 7,.., 99, 100 with the corresponding values of n for which the mean of the untransformed distribution is equal to one. The paper concludes that the inverse square root would yield better results when using MEM with a Weibull-distributed error component and where data transformation is deemed necessary to stabilize the variance of the data set.
我们首先考虑Engle(2002)在金融计量经济学中引入的乘法误差模型(MEM),将其作为正值随机变量的一般时间序列模型,将其分解为条件均值与正值误差项的乘积。考虑到MEM的误差分量可能是威布尔分布,以及在统计建模之前需要进行数据变换作为一种常用的补救措施来稳定数据集的方差,本文研究了平方根反变换(ISRT)对MEM威布尔分布误差分量的均值和方差的影响。在σ=6, 7,…时,计算了威布尔分布的均值和方差以及根号反变换分布的均值和方差。, 99, 100,其对应的n值为未变换分布的均值等于1。本文的结论是,当使用带有威布尔分布误差分量的MEM,并且认为需要进行数据转换以稳定数据集的方差时,平方根反比会产生更好的结果。
{"title":"On a Weibull-Distributed Error Component of a Multiplicative Error Model Under Inverse Square Root Transformation","authors":"C. U. Onyemachi, S. Onyeagu, Samuel Ademola Phillips, Jamiu Adebowale Oke, Callistus Ezekwe Ugwo","doi":"10.11648/J.IJDSA.20210704.12","DOIUrl":"https://doi.org/10.11648/J.IJDSA.20210704.12","url":null,"abstract":"We first consider the Multiplicative Error Model (MEM) introduced in financial econometrics by Engle (2002) as a general class of time series model for positive-valued random variables, which are decomposed into the product of their conditional mean and a positive-valued error term. Considering the possibility that the error component of a MEM can be a Weibull distribution and the need for data transformation as a popular remedial measure to stabilize the variance of a data set prior to statistical modeling, this paper investigates the impact of the inverse square root transformation (ISRT) on the mean and variance of a Weibull-distributed error component of a MEM. The mean and variance of the Weibull distribution and those of the inverse square root transformed distribution are calculated for σ=6, 7,.., 99, 100 with the corresponding values of n for which the mean of the untransformed distribution is equal to one. The paper concludes that the inverse square root would yield better results when using MEM with a Weibull-distributed error component and where data transformation is deemed necessary to stabilize the variance of the data set.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"7 1","pages":"109"},"PeriodicalIF":0.0,"publicationDate":"2021-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42564890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of data science : JDS
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1