首页 > 最新文献

Journal of data science : JDS最新文献

英文 中文
Scalable Predictions for Spatial Probit Linear Mixed Models Using Nearest Neighbor Gaussian Processes. 使用最近邻高斯过程的空间Probit线性混合模型的可伸缩预测。
Pub Date : 2022-01-01 Epub Date: 2022-11-03 DOI: 10.6339/22-jds1073
Arkajyoti Saha, Abhirup Datta, Sudipto Banerjee

Spatial probit generalized linear mixed models (spGLMM) with a linear fixed effect and a spatial random effect, endowed with a Gaussian Process prior, are widely used for analysis of binary spatial data. However, the canonical Bayesian implementation of this hierarchical mixed model can involve protracted Markov Chain Monte Carlo sampling. Alternate approaches have been proposed that circumvent this by directly representing the marginal likelihood from spGLMM in terms of multivariate normal cummulative distribution functions (cdf). We present a direct and fast rendition of this latter approach for predictions from a spatial probit linear mixed model. We show that the covariance matrix of the cdf characterizing the marginal cdf of binary spatial data from spGLMM is amenable to approximation using Nearest Neighbor Gaussian Processes (NNGP). This facilitates a scalable prediction algorithm for spGLMM using NNGP that only involves sparse or small matrix computations and can be deployed in an embarrassingly parallel manner. We demonstrate the accuracy and scalability of the algorithm via numerous simulation experiments and an analysis of species presence-absence data.

空间probit广义线性混合模型(spGLMM)具有线性固定效应和空间随机效应,具有高斯过程先验,广泛用于二元空间数据的分析。然而,这种分层混合模型的规范贝叶斯实现可能涉及旷日持久的马尔可夫链蒙特卡罗采样。已经提出了替代方法,通过用多元正态累积分布函数(cdf)直接表示spGLMM的边际似然来规避这一点。我们提出了后一种方法的直接快速再现,用于从空间概率线性混合模型进行预测。我们证明了表征来自spGLMM的二进制空间数据的边缘cdf的cdf的协方差矩阵适用于使用最近邻高斯过程(NNGP)的近似。这促进了使用NNGP的spGLMM的可扩展预测算法,该算法仅涉及稀疏或小矩阵计算,并且可以以令人尴尬的并行方式进行部署。我们通过大量的模拟实验和物种存在-不存在数据的分析,证明了该算法的准确性和可扩展性。
{"title":"Scalable Predictions for Spatial Probit Linear Mixed Models Using Nearest Neighbor Gaussian Processes.","authors":"Arkajyoti Saha, Abhirup Datta, Sudipto Banerjee","doi":"10.6339/22-jds1073","DOIUrl":"10.6339/22-jds1073","url":null,"abstract":"<p><p>Spatial probit generalized linear mixed models (spGLMM) with a linear fixed effect and a spatial random effect, endowed with a Gaussian Process prior, are widely used for analysis of binary spatial data. However, the canonical Bayesian implementation of this hierarchical mixed model can involve protracted Markov Chain Monte Carlo sampling. Alternate approaches have been proposed that circumvent this by directly representing the marginal likelihood from spGLMM in terms of multivariate normal cummulative distribution functions (cdf). We present a direct and fast rendition of this latter approach for predictions from a spatial probit linear mixed model. We show that the covariance matrix of the cdf characterizing the marginal cdf of binary spatial data from spGLMM is amenable to approximation using Nearest Neighbor Gaussian Processes (NNGP). This facilitates a scalable prediction algorithm for spGLMM using NNGP that only involves sparse or small matrix computations and can be deployed in an embarrassingly parallel manner. We demonstrate the accuracy and scalability of the algorithm via numerous simulation experiments and an analysis of species presence-absence data.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"20 4","pages":"533-544"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10544813/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41167232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic Classification of Plasmodium vivax Malaria Recurrence: An Application of Classifying Unknown Cause of Failure in Competing Risks. 间日疟原虫疟疾复发的动态分类:未知失败原因分类在竞争风险中的应用。
Pub Date : 2022-01-01 Epub Date: 2021-12-09 DOI: 10.6339/21-jds1026
Yutong Liu, Feng-Chang Lin, Jessica T Lin, Quefeng Li

A standard competing risks set-up requires both time to event and cause of failure to be fully observable for all subjects. However, in application, the cause of failure may not always be observable, thus impeding the risk assessment. In some extreme cases, none of the causes of failure is observable. In the case of a recurrent episode of Plasmodium vivax malaria following treatment, the patient may have suffered a relapse from a previous infection or acquired a new infection from a mosquito bite. In this case, the time to relapse cannot be modeled when a competing risk, a new infection, is present. The efficacy of a treatment for preventing relapse from a previous infection may be underestimated when the true cause of infection cannot be classified. In this paper, we developed a novel method for classifying the latent cause of failure under a competing risks set-up, which uses not only time to event information but also transition likelihoods between covariates at the baseline and at the time of event occurrence. Our classifier shows superior performance under various scenarios in simulation experiments. The method was applied to Plasmodium vivax infection data to classify recurrent infections of malaria.

标准的竞争风险设置要求事件发生时间和失败原因对所有主体都是完全可观察到的。然而,在应用中,故障的原因可能并不总是可见的,从而阻碍了风险评估。在一些极端的情况下,没有一个失败的原因是可观察到的。在治疗后间日疟原虫疟疾复发的病例中,患者可能因先前感染而复发或因蚊虫叮咬而获得新的感染。在这种情况下,当存在竞争风险,即新的感染时,复发的时间无法建模。当无法确定感染的真正原因时,预防以前感染复发的治疗效果可能被低估。在本文中,我们开发了一种在竞争风险设置下对潜在故障原因进行分类的新方法,该方法不仅使用事件信息的时间,而且使用基线和事件发生时协变量之间的转换可能性。在仿真实验中,我们的分类器在各种场景下都表现出优异的性能。将该方法应用于间日疟原虫感染资料,对疟疾复发感染进行分类。
{"title":"Dynamic Classification of <i>Plasmodium vivax</i> Malaria Recurrence: An Application of Classifying Unknown Cause of Failure in Competing Risks.","authors":"Yutong Liu,&nbsp;Feng-Chang Lin,&nbsp;Jessica T Lin,&nbsp;Quefeng Li","doi":"10.6339/21-jds1026","DOIUrl":"https://doi.org/10.6339/21-jds1026","url":null,"abstract":"<p><p>A standard competing risks set-up requires both time to event and cause of failure to be fully observable for all subjects. However, in application, the cause of failure may not always be observable, thus impeding the risk assessment. In some extreme cases, none of the causes of failure is observable. In the case of a recurrent episode of <i>Plasmodium vivax</i> malaria following treatment, the patient may have suffered a relapse from a previous infection or acquired a new infection from a mosquito bite. In this case, the time to relapse cannot be modeled when a competing risk, a new infection, is present. The efficacy of a treatment for preventing relapse from a previous infection may be underestimated when the true cause of infection cannot be classified. In this paper, we developed a novel method for classifying the latent cause of failure under a competing risks set-up, which uses not only time to event information but also transition likelihoods between covariates at the baseline and at the time of event occurrence. Our classifier shows superior performance under various scenarios in simulation experiments. The method was applied to <i>Plasmodium vivax</i> infection data to classify recurrent infections of malaria.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":"51-78"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9347664/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40585832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The Python Package open-crypto: A Cryptocurrency Data Collector Python包open-crypto:一个加密货币数据收集器
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1059
Steffen Günther, C. Fieberg, Thorsten Poddig
This paper introduces the package open-crypto for free-of-charge and systematic cryptocurrency data collecting. The package supports several methods to request (1) static data, (2) real-time data and (3) historical data. It allows to retrieve data from over 100 of the most popular and liquid exchanges world-wide. New exchanges can easily be added with the help of provided templates or updated with build-in functions from the project repository. The package is available on GitHub and the Python package index (PyPi). The data is stored in a relational SQL database and therefore accessible from many different programming languages. We provide a hands-on and illustrations for each data type, explanations on the received data and also demonstrate the usability from R and Matlab. Academic research heavily relies on costly or confidential data, however, open data projects are becoming increasingly important. This project is mainly motivated to contribute to openly accessible software and free data in the cryptocurrency markets to improve transparency and reproducibility in research and any other disciplines.
本文介绍了免费、系统地收集加密货币数据的open-crypto包。该包支持几种方法来请求(1)静态数据,(2)实时数据和(3)历史数据。它允许从全球100多个最受欢迎和最具流动性的交易所检索数据。在提供的模板的帮助下,可以很容易地添加新的交换,或者使用项目存储库中的内置功能进行更新。该包可在GitHub和Python包索引(PyPi)上获得。数据存储在关系SQL数据库中,因此可以从许多不同的编程语言访问。我们为每种数据类型提供了动手和插图,对接收到的数据进行了解释,并演示了R和Matlab的可用性。学术研究严重依赖于昂贵或机密的数据,然而,开放数据项目正变得越来越重要。这个项目的主要动机是在加密货币市场上为开放访问的软件和免费数据做出贡献,以提高研究和任何其他学科的透明度和可重复性。
{"title":"The Python Package open-crypto: A Cryptocurrency Data Collector","authors":"Steffen Günther, C. Fieberg, Thorsten Poddig","doi":"10.6339/22-jds1059","DOIUrl":"https://doi.org/10.6339/22-jds1059","url":null,"abstract":"This paper introduces the package open-crypto for free-of-charge and systematic cryptocurrency data collecting. The package supports several methods to request (1) static data, (2) real-time data and (3) historical data. It allows to retrieve data from over 100 of the most popular and liquid exchanges world-wide. New exchanges can easily be added with the help of provided templates or updated with build-in functions from the project repository. The package is available on GitHub and the Python package index (PyPi). The data is stored in a relational SQL database and therefore accessible from many different programming languages. We provide a hands-on and illustrations for each data type, explanations on the received data and also demonstrate the usability from R and Matlab. Academic research heavily relies on costly or confidential data, however, open data projects are becoming increasingly important. This project is mainly motivated to contribute to openly accessible software and free data in the cryptocurrency markets to improve transparency and reproducibility in research and any other disciplines.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multiresolution Broad Area Search: Monitoring Spatial Characteristics of Gapless Remote Sensing Data 多分辨率广域搜索:监测无间隙遥感数据的空间特征
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1072
Laura J. Wendelberger, J. Gray, Alyson G. Wilson, R. Houborg, B. Reich
Global earth monitoring aims to identify and characterize land cover change like construction as it occurs. Remote sensing makes it possible to collect large amounts of data in near real-time over vast geographic areas and is becoming available in increasingly fine temporal and spatial resolution. Many methods have been developed for data from a single pixel, but monitoring pixel-wise spectral measurements over time neglects spatial relationships, which become more important as change manifests in a greater number of pixels in higher resolution imagery compared to moderate resolution. Building on our previous robust online Bayesian monitoring (roboBayes) algorithm, we propose monitoring multiresolution signals based on a wavelet decomposition to capture spatial change coherence on several scales to detect change sites. Monitoring only a subset of relevant signals reduces the computational burden. The decomposition relies on gapless data; we use 3 m Planet Fusion Monitoring data. Simulations demonstrate the superiority of the spatial signals in multiresolution roboBayes (MR roboBayes) for detecting subtle changes compared to pixel-wise roboBayes. We use MR roboBayes to detect construction changes in two regions with distinct land cover and seasonal characteristics: Jacksonville, FL (USA) and Dubai (UAE). It achieves site detection with less than two thirds of the monitoring processes required for pixel-wise roboBayes at the same resolution.
全球地球监测旨在识别和描述土地覆盖变化,如建筑变化。遥感技术使在广大地理区域近乎实时地收集大量数据成为可能,而且其时间和空间分辨率也越来越高。对于单个像素的数据已经开发了许多方法,但是随着时间的推移监测逐像素的光谱测量忽略了空间关系,随着变化在高分辨率图像中表现为与中等分辨率相比更多的像素数量,空间关系变得更加重要。在我们之前的鲁棒在线贝叶斯监测(roboBayes)算法的基础上,我们提出了基于小波分解的多分辨率信号监测,以捕获多个尺度上的空间变化相干性来检测变化地点。只监视相关信号的子集可以减少计算负担。分解依赖于无间隙数据;我们使用3 m行星融合监测数据。仿真证明了空间信号在多分辨率机器人贝叶斯(MR roboBayes)中检测细微变化的优势,与像素级机器人贝叶斯相比。我们使用MR机器人贝叶斯来检测两个具有不同土地覆盖和季节特征的地区的建筑变化:美国佛罗里达州的杰克逊维尔和阿联酋的迪拜。在相同分辨率下,它只需要不到三分之二的逐像素机器人贝叶斯所需的监测过程就能实现站点检测。
{"title":"Multiresolution Broad Area Search: Monitoring Spatial Characteristics of Gapless Remote Sensing Data","authors":"Laura J. Wendelberger, J. Gray, Alyson G. Wilson, R. Houborg, B. Reich","doi":"10.6339/22-jds1072","DOIUrl":"https://doi.org/10.6339/22-jds1072","url":null,"abstract":"Global earth monitoring aims to identify and characterize land cover change like construction as it occurs. Remote sensing makes it possible to collect large amounts of data in near real-time over vast geographic areas and is becoming available in increasingly fine temporal and spatial resolution. Many methods have been developed for data from a single pixel, but monitoring pixel-wise spectral measurements over time neglects spatial relationships, which become more important as change manifests in a greater number of pixels in higher resolution imagery compared to moderate resolution. Building on our previous robust online Bayesian monitoring (roboBayes) algorithm, we propose monitoring multiresolution signals based on a wavelet decomposition to capture spatial change coherence on several scales to detect change sites. Monitoring only a subset of relevant signals reduces the computational burden. The decomposition relies on gapless data; we use 3 m Planet Fusion Monitoring data. Simulations demonstrate the superiority of the spatial signals in multiresolution roboBayes (MR roboBayes) for detecting subtle changes compared to pixel-wise roboBayes. We use MR roboBayes to detect construction changes in two regions with distinct land cover and seasonal characteristics: Jacksonville, FL (USA) and Dubai (UAE). It achieves site detection with less than two thirds of the monitoring processes required for pixel-wise roboBayes at the same resolution.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Subpopulation Treatment Effect Pattern Plot (STEPP) Methods with R and Stata 亚种群处理效应模式图(STEPP)方法
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1060
S. Venturini, M. Bonetti, A. Lazar, B. Cole, Xin Victoria Wang, R. Gelber, Wai-Ki Yip
We introduce the stepp packages for R and Stata that implement the subpopulation treatment effect pattern plot (STEPP) method. STEPP is a nonparametric graphical tool aimed at examining possible heterogeneous treatment effects in subpopulations defined on a continuous covariate or composite score. More pecifically, STEPP considers overlapping subpopulations defined with respect to a continuous covariate (or risk index) and it estimates a treatment effect for each subpopulation. It also produces confidence regions and tests for treatment effect heterogeneity among the subpopulations. The original method has been extended in different directions such as different survival contexts, outcome types, or more efficient procedures for identifying the overlapping subpopulations. In this paper, we also introduce a novel method to determine the number of subjects within the subpopulations by minimizing the variability of the sizes of the subpopulations generated by a specific parameter combination. We illustrate the packages using both synthetic data and publicly available data sets. The most intensive computations in R are implemented in Fortran, while the Stata version exploits the powerful Mata language.
我们介绍了R和Stata的stepp软件包,实现了亚种群处理效应模式图(stepp)方法。STEPP是一种非参数图形工具,旨在检查在连续协变量或复合评分定义的亚群中可能存在的异质性治疗效果。更具体地说,STEPP考虑根据连续协变量(或风险指数)定义的重叠亚群,并估计每个亚群的治疗效果。它也产生置信区域和亚群间治疗效果异质性的检验。原来的方法已经扩展到不同的方向,如不同的生存环境,结果类型,或更有效的程序,以确定重叠的亚群。在本文中,我们还引入了一种新的方法,通过最小化由特定参数组合产生的子种群大小的可变性来确定子种群内的受试者数量。我们使用合成数据和公开可用的数据集来说明这些包。R中最密集的计算是用Fortran实现的,而Stata版本则利用了强大的Mata语言。
{"title":"Subpopulation Treatment Effect Pattern Plot (STEPP) Methods with R and Stata","authors":"S. Venturini, M. Bonetti, A. Lazar, B. Cole, Xin Victoria Wang, R. Gelber, Wai-Ki Yip","doi":"10.6339/22-jds1060","DOIUrl":"https://doi.org/10.6339/22-jds1060","url":null,"abstract":"We introduce the stepp packages for R and Stata that implement the subpopulation treatment effect pattern plot (STEPP) method. STEPP is a nonparametric graphical tool aimed at examining possible heterogeneous treatment effects in subpopulations defined on a continuous covariate or composite score. More pecifically, STEPP considers overlapping subpopulations defined with respect to a continuous covariate (or risk index) and it estimates a treatment effect for each subpopulation. It also produces confidence regions and tests for treatment effect heterogeneity among the subpopulations. The original method has been extended in different directions such as different survival contexts, outcome types, or more efficient procedures for identifying the overlapping subpopulations. In this paper, we also introduce a novel method to determine the number of subjects within the subpopulations by minimizing the variability of the sizes of the subpopulations generated by a specific parameter combination. We illustrate the packages using both synthetic data and publicly available data sets. The most intensive computations in R are implemented in Fortran, while the Stata version exploits the powerful Mata language.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Impact of COVID-19 on Subjective Well-Being: Evidence from Twitter Data COVID-19对主观幸福感的影响:来自Twitter数据的证据
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1066
Tiziana Carpi, Airo Hino, S. Iacus, G. Porro
This study analyzes the impact of the COVID-19 pandemic on subjective well-being as measured through Twitter for the countries of Japan and Italy. In the first nine months of 2020, the Twitter indicators dropped by 11.7% for Italy and 8.3% for Japan compared to the last two months of 2019, and even more compared to their historical means. To understand what affected the Twitter mood so strongly, the study considers a pool of potential factors including: climate and air quality data, number of COVID-19 cases and deaths, Facebook COVID-19 and flu-like symptoms global survey data, coronavirus-related Google search data, policy intervention measures, human mobility data, macro economic variables, as well as health and stress proxy variables. This study proposes a framework to analyse and assess the relative impact of these external factors on the dynamic of Twitter mood and further implements a structural model to describe the underlying concept of subjective well-being. It turns out that prolonged mobility restrictions, flu and Covid-like symptoms, economic uncertainty and low levels of quality in social interactions have a negative impact on well-being.
本研究分析了COVID-19大流行对日本和意大利两国主观幸福感的影响,通过Twitter进行了测量。与2019年最后两个月相比,2020年前9个月,意大利和日本的推特指标分别下降了11.7%和8.3%,与历史平均值相比,下降幅度更大。为了了解是什么对推特情绪产生了如此强烈的影响,该研究考虑了一系列潜在因素,包括:气候和空气质量数据、COVID-19病例和死亡人数、Facebook COVID-19和流感样症状全球调查数据、冠状病毒相关的谷歌搜索数据、政策干预措施、人类流动性数据、宏观经济变量以及健康和压力代理变量。本研究提出了一个框架来分析和评估这些外部因素对Twitter情绪动态的相对影响,并进一步实现了一个结构模型来描述主观幸福感的基本概念。事实证明,长期的行动限制、流感和冠状病毒样症状、经济不确定性和社会交往质量低下对幸福感产生了负面影响。
{"title":"The Impact of COVID-19 on Subjective Well-Being: Evidence from Twitter Data","authors":"Tiziana Carpi, Airo Hino, S. Iacus, G. Porro","doi":"10.6339/22-jds1066","DOIUrl":"https://doi.org/10.6339/22-jds1066","url":null,"abstract":"This study analyzes the impact of the COVID-19 pandemic on subjective well-being as measured through Twitter for the countries of Japan and Italy. In the first nine months of 2020, the Twitter indicators dropped by 11.7% for Italy and 8.3% for Japan compared to the last two months of 2019, and even more compared to their historical means. To understand what affected the Twitter mood so strongly, the study considers a pool of potential factors including: climate and air quality data, number of COVID-19 cases and deaths, Facebook COVID-19 and flu-like symptoms global survey data, coronavirus-related Google search data, policy intervention measures, human mobility data, macro economic variables, as well as health and stress proxy variables. This study proposes a framework to analyse and assess the relative impact of these external factors on the dynamic of Twitter mood and further implements a structural model to describe the underlying concept of subjective well-being. It turns out that prolonged mobility restrictions, flu and Covid-like symptoms, economic uncertainty and low levels of quality in social interactions have a negative impact on well-being.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Sampling-based Gaussian Mixture Regression for Big Data 基于抽样的大数据高斯混合回归
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1057
Joochul Lee, E. Schifano, Haiying Wang
This paper proposes a nonuniform subsampling method for finite mixtures of regression models to reduce large data computational tasks. A general estimator based on a subsample is investigated, and its asymptotic normality is established. We assign optimal subsampling probabilities to data points that minimize the asymptotic mean squared errors of the general estimator and linearly transformed estimators. Since the proposed probabilities depend on unknown parameters, an implementable algorithm is developed. We first approximate the optimal subsampling probabilities using a pilot sample. After that, we select a subsample using the approximated subsampling probabilities and compute estimates using the subsample. We evaluate the proposed method in a simulation study and present a real data example using appliance energy data.
为了减少大数据的计算量,提出了一种有限混合回归模型的非均匀次抽样方法。研究了基于子样本的一般估计量,并建立了它的渐近正态性。我们将最优子抽样概率分配给数据点,使一般估计量和线性变换估计量的渐近均方误差最小。由于所提出的概率依赖于未知参数,因此提出了一种可实现的算法。我们首先使用先导样本近似最优子抽样概率。然后,我们使用近似的子抽样概率选择子样本,并使用该子样本计算估计。我们在一个模拟研究中对所提出的方法进行了评估,并给出了一个使用电器能量数据的真实数据示例。
{"title":"Sampling-based Gaussian Mixture Regression for Big Data","authors":"Joochul Lee, E. Schifano, Haiying Wang","doi":"10.6339/22-jds1057","DOIUrl":"https://doi.org/10.6339/22-jds1057","url":null,"abstract":"This paper proposes a nonuniform subsampling method for finite mixtures of regression models to reduce large data computational tasks. A general estimator based on a subsample is investigated, and its asymptotic normality is established. We assign optimal subsampling probabilities to data points that minimize the asymptotic mean squared errors of the general estimator and linearly transformed estimators. Since the proposed probabilities depend on unknown parameters, an implementable algorithm is developed. We first approximate the optimal subsampling probabilities using a pilot sample. After that, we select a subsample using the approximated subsampling probabilities and compute estimates using the subsample. We evaluate the proposed method in a simulation study and present a real data example using appliance energy data.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
What Kind of Music Do You Like? A Statistical Analysis of Music Genre Popularity Over Time 你喜欢什么样的音乐?音乐类型随时间流行的统计分析
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1040
Aimée M. Petitbon, D. B. Hitchcock
{"title":"What Kind of Music Do You Like? A Statistical Analysis of Music Genre Popularity Over Time","authors":"Aimée M. Petitbon, D. B. Hitchcock","doi":"10.6339/22-jds1040","DOIUrl":"https://doi.org/10.6339/22-jds1040","url":null,"abstract":"<jats:p />","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A Joint Analysis for Field Goal Attempts and Percentages of Professional Basketball Players: Bayesian Nonparametric Resource 职业篮球运动员投篮命中率与投篮命中率的联合分析:贝叶斯非参数资源
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1062
Eliot Wong-Toi, Hou‐Cheng Yang, Weining Shen, Guanyu Hu
Understanding shooting patterns among different players is a fundamental problem in basketball game analyses. In this paper, we quantify the shooting pattern via the field goal attempts and percentages over twelve non-overlapping regions around the front court. A joint Bayesian nonparametric mixture model is developed to find latent clusters of players based on their shooting patterns. We apply our proposed model to learn the heterogeneity among selected players from the National Basketball Association (NBA) games over the 2018–2019 regular season and 2019–2020 bubble season. Thirteen clusters are identified for 2018–2019 regular season and seven clusters are identified for 2019–2020 bubble season. We further examine the shooting patterns of players in these clusters and discuss their relation to players’ other available information. The results shed new insights on the effect of NBA COVID bubble and may provide useful guidance for player’s shot selection and team’s in-game and recruiting strategy planning.
了解不同球员的投篮模式是篮球比赛分析中的一个基本问题。在本文中,我们通过前场周围12个不重叠区域的投篮命中率和命中率来量化投篮模式。建立了一个联合贝叶斯非参数混合模型,根据球员的投篮模式寻找潜在的球员簇。我们将提出的模型应用于2018-2019赛季常规赛和2019-2020赛季NBA比赛中被选中的球员之间的异质性。2018-2019赛季确定了13个集群,2019-2020赛季确定了7个集群。我们进一步研究了这些集群中球员的投篮模式,并讨论了它们与球员其他可用信息的关系。研究结果为NBA COVID泡沫的影响提供了新的见解,并可能为球员的投篮选择和球队的比赛和招募策略规划提供有用的指导。
{"title":"A Joint Analysis for Field Goal Attempts and Percentages of Professional Basketball Players: Bayesian Nonparametric Resource","authors":"Eliot Wong-Toi, Hou‐Cheng Yang, Weining Shen, Guanyu Hu","doi":"10.6339/22-jds1062","DOIUrl":"https://doi.org/10.6339/22-jds1062","url":null,"abstract":"Understanding shooting patterns among different players is a fundamental problem in basketball game analyses. In this paper, we quantify the shooting pattern via the field goal attempts and percentages over twelve non-overlapping regions around the front court. A joint Bayesian nonparametric mixture model is developed to find latent clusters of players based on their shooting patterns. We apply our proposed model to learn the heterogeneity among selected players from the National Basketball Association (NBA) games over the 2018–2019 regular season and 2019–2020 bubble season. Thirteen clusters are identified for 2018–2019 regular season and seven clusters are identified for 2019–2020 bubble season. We further examine the shooting patterns of players in these clusters and discuss their relation to players’ other available information. The results shed new insights on the effect of NBA COVID bubble and may provide useful guidance for player’s shot selection and team’s in-game and recruiting strategy planning.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchical Ridge Regression for Incorporating Prior Information in Genomic Studies. 在基因组研究中纳入先验信息的层次岭回归。
Pub Date : 2022-01-01 Epub Date: 2021-12-13 DOI: 10.6339/21-jds1030
Eric S Kawaguchi, Sisi Li, Garrett M Weaver, Juan Pablo Lewinger

There is a great deal of prior knowledge about gene function and regulation in the form of annotations or prior results that, if directly integrated into individual prognostic or diagnostic studies, could improve predictive performance. For example, in a study to develop a predictive model for cancer survival based on gene expression, effect sizes from previous studies or the grouping of genes based on pathways constitute such prior knowledge. However, this external information is typically only used post-analysis to aid in the interpretation of any findings. We propose a new hierarchical two-level ridge regression model that can integrate external information in the form of "meta features" to predict an outcome. We show that the model can be fit efficiently using cyclic coordinate descent by recasting the problem as a single-level regression model. In a simulation-based evaluation we show that the proposed method outperforms standard ridge regression and competing methods that integrate prior information, in terms of prediction performance when the meta features are informative on the mean of the features, and that there is no loss in performance when the meta features are uninformative. We demonstrate our approach with applications to the prediction of chronological age based on methylation features and breast cancer mortality based on gene expression features.

以注释或先前结果的形式存在着大量有关基因功能和调控的先验知识,如果将这些先验知识直接整合到单项预后或诊断研究中,可以提高预测效果。例如,在根据基因表达建立癌症生存预测模型的研究中,以往研究的效应大小或基于通路的基因分组就构成了此类先验知识。然而,这些外部信息通常只能在分析后使用,以帮助解释研究结果。我们提出了一种新的分层两级脊回归模型,它可以整合 "元特征 "形式的外部信息来预测结果。我们表明,通过将问题重铸为单层回归模型,可以使用循环坐标下降法高效拟合该模型。在基于模拟的评估中,我们发现当元特征对特征的平均值具有参考价值时,所提出的方法在预测性能方面优于标准脊回归和整合先验信息的竞争方法;而当元特征对特征的平均值不具有参考价值时,所提出的方法在性能方面没有任何损失。我们将我们的方法应用于基于甲基化特征的年代预测和基于基因表达特征的乳腺癌死亡率预测。
{"title":"Hierarchical Ridge Regression for Incorporating Prior Information in Genomic Studies.","authors":"Eric S Kawaguchi, Sisi Li, Garrett M Weaver, Juan Pablo Lewinger","doi":"10.6339/21-jds1030","DOIUrl":"10.6339/21-jds1030","url":null,"abstract":"<p><p>There is a great deal of prior knowledge about gene function and regulation in the form of annotations or prior results that, if directly integrated into individual prognostic or diagnostic studies, could improve predictive performance. For example, in a study to develop a predictive model for cancer survival based on gene expression, effect sizes from previous studies or the grouping of genes based on pathways constitute such prior knowledge. However, this external information is typically only used post-analysis to aid in the interpretation of any findings. We propose a new hierarchical two-level ridge regression model that can integrate external information in the form of \"meta features\" to predict an outcome. We show that the model can be fit efficiently using cyclic coordinate descent by recasting the problem as a single-level regression model. In a simulation-based evaluation we show that the proposed method outperforms standard ridge regression and competing methods that integrate prior information, in terms of prediction performance when the meta features are informative on the mean of the features, and that there is no loss in performance when the meta features are uninformative. We demonstrate our approach with applications to the prediction of chronological age based on methylation features and breast cancer mortality based on gene expression features.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"20 1","pages":"34-50"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9581069/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10451046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of data science : JDS
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1