J. Roca-Pardiñas, M. Rodríguez-Álvarez, S. Sperlich
A package is introduced that provides the weighted smooth backfitting estimator for a large family of popular semiparametric regression models. This family is known as generalized structured models, comprising, for example, generalized varying coefficient model, generalized additive models, mixtures, potentially including parametric parts. The kernel based weighted smooth backfitting belongs to the statistically most efficient procedures for this model class. Its asymptotic properties are well understood thanks to the large body of literature about this estimator. The introduced weights allow for the inclusion of sampling weights, trimming, and efficient estimation under heteroscedasticity. Further options facilitate an easy handling of aggregated data, prediction, and the presentation of estimation results. Cross-validation methods are provided which can be used for model and bandwidth selection.
{"title":"Package wsbackfit for Smooth Backfitting Estimation of Generalized Structured Models","authors":"J. Roca-Pardiñas, M. Rodríguez-Álvarez, S. Sperlich","doi":"10.32614/rj-2021-042","DOIUrl":"https://doi.org/10.32614/rj-2021-042","url":null,"abstract":"A package is introduced that provides the weighted smooth backfitting estimator for a large family of popular semiparametric regression models. This family is known as generalized structured models, comprising, for example, generalized varying coefficient model, generalized additive models, mixtures, potentially including parametric parts. The kernel based weighted smooth backfitting belongs to the statistically most efficient procedures for this model class. Its asymptotic properties are well understood thanks to the large body of literature about this estimator. The introduced weights allow for the inclusion of sampling weights, trimming, and efficient estimation under heteroscedasticity. Further options facilitate an easy handling of aggregated data, prediction, and the presentation of estimation results. Cross-validation methods are provided which can be used for model and bandwidth selection.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"13 1","pages":"330"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78498328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Regular expressions are powerful tools for extracting tables from non-tabular text data. Capturing regular expressions that describe the information to extract from column names can be especially useful when reshaping a data table from wide (few rows with many regularly named columns) to tall (fewer columns with more rows). We present the R package nc (short for named capture), which provides functions for wide-to-tall data reshaping using regular expressions. We describe the main new ideas of nc, and provide detailed comparisons with related R packages (stats, utils, data.table, tidyr, tidyfast, tidyfst, reshape2, cdata).
{"title":"Wide-to-tall Data Reshaping Using Regular Expressions and the nc Package","authors":"T. Hocking","doi":"10.32614/rj-2021-029","DOIUrl":"https://doi.org/10.32614/rj-2021-029","url":null,"abstract":"Regular expressions are powerful tools for extracting tables from non-tabular text data. Capturing regular expressions that describe the information to extract from column names can be especially useful when reshaping a data table from wide (few rows with many regularly named columns) to tall (fewer columns with more rows). We present the R package nc (short for named capture), which provides functions for wide-to-tall data reshaping using regular expressions. We describe the main new ideas of nc, and provide detailed comparisons with related R packages (stats, utils, data.table, tidyr, tidyfast, tidyfst, reshape2, cdata).","PeriodicalId":20974,"journal":{"name":"R J.","volume":"66 1","pages":"69"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73837511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sébastien Wouters, A. Silva, F. Boulvain, X. Devleeschouwer
The StratigrapheR package proposes new concepts for the generation of lithological logs, or lithologs, in R. The generation of lithologs in a scripting environment opens new opportunities for the processing and analysis of stratified geological data. Among the new concepts presented: new plotting and data processing methodologies, new general R functions, and computer-oriented data conventions are provided. The package structure allows for these new concepts to be further improved, which can be done independently by any R user. The current limitations of the package are highlighted, along with the limitations in R for geological data processing, to help identify the best paths for improvements. Introduction StratigrapheR is a package implemented in the open-source programming environment R. StratigrapheR endeavors to explore new concepts to process stratified geological data. These concepts are provided to answer a major difficulty posed by such data; namely a large amount of field observations of varied nature, sometimes localized and small-scale, can carry information on large-scale processes. Visualizing the relevant observations all at once is therefore difficult. The usual answer to this problem in successions of stratified rocks is to report observations in a schematic form: the lithological log, or litholog (e.g., Fig. 1). The litholog is an essential tool in sedimentology and stratigraphy and proves to be equally invaluable in other fields such as volcanology, igneous petrology, or paleontology. Ideally, any data contained in a litholog should be available in a reproducible form. Therefore, the challenge at hand is what we would call "from art to useful data"; how can we best extract and/or process the information contained in a litholog, designed to be as visually informative as possible (see again Fig. 1). 28 29 30 31 32 33 34 44 45a 45b 45c 46 47 48 49 51 35 52a 52b 60a 60b 60c 61 HIATUS lamellar stromatoporoids branching stromatoporoids lamellar tabulate corals branching tabulate corals brachiopods crinoids receptaculitids small fenestrae large fenestrae
{"title":"StratigrapheR: Concepts for Litholog Generation in R","authors":"Sébastien Wouters, A. Silva, F. Boulvain, X. Devleeschouwer","doi":"10.32614/rj-2021-039","DOIUrl":"https://doi.org/10.32614/rj-2021-039","url":null,"abstract":"The StratigrapheR package proposes new concepts for the generation of lithological logs, or lithologs, in R. The generation of lithologs in a scripting environment opens new opportunities for the processing and analysis of stratified geological data. Among the new concepts presented: new plotting and data processing methodologies, new general R functions, and computer-oriented data conventions are provided. The package structure allows for these new concepts to be further improved, which can be done independently by any R user. The current limitations of the package are highlighted, along with the limitations in R for geological data processing, to help identify the best paths for improvements. Introduction StratigrapheR is a package implemented in the open-source programming environment R. StratigrapheR endeavors to explore new concepts to process stratified geological data. These concepts are provided to answer a major difficulty posed by such data; namely a large amount of field observations of varied nature, sometimes localized and small-scale, can carry information on large-scale processes. Visualizing the relevant observations all at once is therefore difficult. The usual answer to this problem in successions of stratified rocks is to report observations in a schematic form: the lithological log, or litholog (e.g., Fig. 1). The litholog is an essential tool in sedimentology and stratigraphy and proves to be equally invaluable in other fields such as volcanology, igneous petrology, or paleontology. Ideally, any data contained in a litholog should be available in a reproducible form. Therefore, the challenge at hand is what we would call \"from art to useful data\"; how can we best extract and/or process the information contained in a litholog, designed to be as visually informative as possible (see again Fig. 1). 28 29 30 31 32 33 34 44 45a 45b 45c 46 47 48 49 51 35 52a 52b 60a 60b 60c 61 HIATUS lamellar stromatoporoids branching stromatoporoids lamellar tabulate corals branching tabulate corals brachiopods crinoids receptaculitids small fenestrae large fenestrae","PeriodicalId":20974,"journal":{"name":"R J.","volume":"29 1","pages":"70"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90626911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The conveyance of clinical trial explorations and analysis results from a statistician to a clinical investigator is a critical component to the drug development and clinical research cycle. Automating the process of generating documents for data descriptions, summaries, exploration, and analysis allows statistician to provide a more comprehensive view of the information captured by a clinical trial and efficient generation of these documents allows the statistican to focus more on the conceptual development of a trial or trial analysis and less on the implementation of the summaries and results on which decisions are made. This paper explores the use of the listdown package for automating reproducible documents in clinical trials that facilitate the collaboration between statisticians and clinicians as well as defining an analysis pipeline for document generation. Background and Introduction The conveyance of clinical trial explorations and analysis results from a statistician to a clinical investigator is an often overlooked but critical component to the drug development and clinical research cycle. Graphs, tables, and other analysis artifacts are at the nexus of these collaborations. They facilitate identifying problems and bugs in the data preparation and processing stage, they help to build an intuitive understanding of mechanisms of disease and their treatment, they elucidate prognostic and predictive relationships, they provide insight that results in new hypotheses, and they convince researchers of analyses testing hypotheses. Despite their importance, the process of generating these artifacts is usually done in an ad-hoc manner. This is partially because of the nuance and diversity of the hypotheses and scientific questions being interrogated and, to a lesser degree, the variation in clinical data formatting. The usual process usually has a statistician providing a standard set of artifacts, receiving feedback, and providing an updates based on feedback. Work performed for one trial is rarely leveraged on others and as a result, a large amount of work needs to be reproduced for each trial. There are two glaring problems with this approach. First, each analysis of a trial requires a substantial amount of error-prone work. While the variation between trials means some work needs to be done for preparation, exploration, and analysis, there are many aspects of these trials that could be better automated resulting in greater efficiency and accuracy. Second, because this work is challenging, it often occupies the majority of the statisticians effort. Less time is spent on trial design and analysis and the this portion is taken up by a clinician who often has less expertise with the statistical aspects of the trial. As a result, the extra effort spent on processing data undermines statisticians role as a collaborator and relegates them to service provider. Need tools leveraging existing work to more efficiently provide holistic views on trials
将临床试验结果和分析结果从统计学家传递给临床研究者是药物开发和临床研究周期的关键组成部分。为数据描述、摘要、探索和分析生成文档的自动化过程使统计学家能够对临床试验捕获的信息提供更全面的视图,并且这些文档的高效生成使统计学家能够更多地关注试验或试验分析的概念发展,而不是关注制定决策的摘要和结果的实施。本文探讨了listdown包在临床试验中自动化可重复文件的使用,促进了统计学家和临床医生之间的合作,并定义了文档生成的分析管道。从统计学家到临床研究者的临床试验探索和分析结果的传递是药物开发和临床研究周期中经常被忽视的关键组成部分。图、表和其他分析工件是这些协作的连接点。它们有助于识别数据准备和处理阶段的问题和缺陷,有助于建立对疾病及其治疗机制的直观理解,阐明预后和预测关系,提供导致新假设的洞察力,并说服研究人员进行分析以检验假设。尽管它们很重要,但是生成这些工件的过程通常是以一种特别的方式完成的。这部分是因为假设和科学问题的细微差别和多样性,在较小程度上,临床数据格式的变化。通常的流程通常由统计人员提供一组标准工件,接收反馈,并根据反馈提供更新。为一个试验执行的工作很少对其他试验产生影响,因此,需要为每个试验重复大量的工作。这种方法有两个明显的问题。首先,每次试验分析都需要大量容易出错的工作。虽然试验之间的差异意味着需要做一些准备、探索和分析工作,但这些试验的许多方面可以更好地自动化,从而提高效率和准确性。其次,由于这项工作具有挑战性,它往往占据了统计学家的大部分精力。在试验设计和分析上花费的时间较少,这部分工作由临床医生承担,他们通常对试验的统计方面缺乏专业知识。因此,花费在处理数据上的额外努力破坏了统计学家作为合作者的角色,并将他们降级为服务提供者。需要利用现有工作的工具来更有效地提供有关试验的整体视图,这将减少工作量,并使试验设计和分析更加准确和全面。R Core Team(2012)的软件包生态系统的丰富性,特别是其对分析、可视化、可再现性和传播的强调,使得为临床试验创建这些工具的目标变得可行。表的生成由tableone (Yoshida and Bartel, 2020)、gt (Iannone et al., 2020)、gtsummary (Sjoberg et al., 2020)等软件包支持。可视化使用包括ggplot2 (Wickham, 2016)和survminer (Kassambara et al., 2020)在内的软件包实现。我们甚至可以使用DT (Xie et al., 2020)、plot (Sievert, 2020)和trelliscopejs (Hafen and Schloerke, 2020)提供数据的交互式演示。还应该认识到,基于这些临床试验数据工具的工作已经在进行中。报告(Harrell Jr, 2020)包提供了临床试验的图形摘要,并与markdown (Allaire et al., 2020)一起使用,以生成具有特定格式的特定试验报告类型。最近发布的listdown包(Kane et al., 2020)用于自动化生成可重复(RMarkdown)文档的过程。从摘要、探索或分析派生的对象按层次结构存储在R列表中,该列表定义了文档的结构。这些对象被称为计算组件,因为它们来自于计算,而不是散文,后者构成了文档的叙述组件。计算组件捕获并构造要呈现的对象。通过创建listdown对象来描述对象的呈现方式和文档的呈现方式。 计算组件的创建方式与向用户显示方式的分离提供了两个优势。首先,它将数据处理和分析与数据的探索和可视化分离开来。对于计算密集型分析,这种分离对于避免对表示中的小更改进行冗余计算至关重要。它还不鼓励将计算密集型代码放入RMarkdown文档中。其次,它提供了快速更改计算组件的可视化或汇总方式,甚至文档呈现方式的灵活性。这使得从交互式。html文档到静态。pdf文档的转换比替换R Mardown文档中的函数和参数要容易得多。研究发现,该软件包在临床试验数据的报告和研究中特别有用。特别是,该软件包已用于服务器协作,重点是分析过去的试验数据以制定新的试验,或用于试验监测,其中报告试验遥测(登记,反应等)并将初步分析传达给临床医生。相关的陈述需要很少的背景,因为临床医生通常对收集的数据有很好的理解,就像统计学家的意义叙述成分一样,不需要。同时,大量的分层的、异构的工件(表和多种类型的图)可以自动化,而手工创建RMarkdown文档将是不方便和低效的。本文的其余部分描述了在listdown包中实现的概念,用于自动化、可重复的文档生成,并展示了它与一个简化的、合成的临床试验数据集的使用,该数据集的变量是非小细胞肺癌试验的典型变量。数据集来自钳包(Kane, 2020)。在撰写本文档时,该软件包仍在开发中,在CRAN上不可用。但是,它可以按照以下方式安装。下面的小节使用试验数据来构建一个用于文档生成的管道。我们注意到,与大多数此类分析相比,数据和管道都很简单。但是,说明相关概念就足够了,并且分析和概念都可以很容易地转换为实际应用程序。最后一节讨论包的使用及其当前方向。分析数据的过程可以使用Benington(1983)的经典瀑布模型来描述,其中输出(分析表示或服务)依赖于在它之前出现的一系列任务。这种依赖结构意味着,如果在分析生产的给定阶段检测到问题,则必须重新运行所有下游部分以反映更改。瀑布模型的图形化描述,具体到数据分析(临床或其他)如图1所示。请注意,数据探索和可视化是生产所有阶段不可或缺的一部分,通常是识别问题和改进分析的手段。如前一节所述,我们将实现一个简单的分析管道。数据采集和预处理步骤通过从镊子包导入数据集并使用包中实现的一些功能来创建单个试验数据集来处理,从而减少管道中的这些组件的重要性。虽然这些步骤是至关重要的,但本文的重点是将listdown包合并到后面的阶段。数据采集和预处理数据采集是指分析管道的一部分,在此部分中,数据从某些托管数据存储中检索,以便集成到管道中。这些数据集可以作为表从数据库、病例报告、根据临床数据交换标准联盟(CDISC) (CDI, 2020)、电子健康记录或其他临床真实世界数据(RWD)格式格式化的分析数据模型(ADaM)数据中检索。然后将这些数据转换为适合分析的格式。R Journal Vol. XX/YY, AAAA ISSN 2073-4859贡献研究文章3图1:数据分析瀑布。在我们的简单示例中,这是通过加载与试验结果、患者不良事件、患者生物标志物和患者人口统计相对应的数据,并使用镊子和dplyr (Wickham et al., 2020)包将其转换为单个数据集,每个患者一行,每列一个变量来完成的。数据还包括纵向不良事件信息,这些信息将作为一个嵌套数据框架存储在结果数据集的ae_long列中。 库(forceps)库(dplyr)数据(lc_adsl, lc_adverse_events, lc_biomarkers, lc_demography) lc_trial %队列(on = "usubjid", name = "ae_long"), biomarkers = lc_biomarkers
{"title":"Automating Reproducible, Collaborative Clinical Trial Document Generation with the listdown Package","authors":"M. Kane, Xun Jiang, Simon Urbanek","doi":"10.32614/rj-2021-051","DOIUrl":"https://doi.org/10.32614/rj-2021-051","url":null,"abstract":"The conveyance of clinical trial explorations and analysis results from a statistician to a clinical investigator is a critical component to the drug development and clinical research cycle. Automating the process of generating documents for data descriptions, summaries, exploration, and analysis allows statistician to provide a more comprehensive view of the information captured by a clinical trial and efficient generation of these documents allows the statistican to focus more on the conceptual development of a trial or trial analysis and less on the implementation of the summaries and results on which decisions are made. This paper explores the use of the listdown package for automating reproducible documents in clinical trials that facilitate the collaboration between statisticians and clinicians as well as defining an analysis pipeline for document generation. Background and Introduction The conveyance of clinical trial explorations and analysis results from a statistician to a clinical investigator is an often overlooked but critical component to the drug development and clinical research cycle. Graphs, tables, and other analysis artifacts are at the nexus of these collaborations. They facilitate identifying problems and bugs in the data preparation and processing stage, they help to build an intuitive understanding of mechanisms of disease and their treatment, they elucidate prognostic and predictive relationships, they provide insight that results in new hypotheses, and they convince researchers of analyses testing hypotheses. Despite their importance, the process of generating these artifacts is usually done in an ad-hoc manner. This is partially because of the nuance and diversity of the hypotheses and scientific questions being interrogated and, to a lesser degree, the variation in clinical data formatting. The usual process usually has a statistician providing a standard set of artifacts, receiving feedback, and providing an updates based on feedback. Work performed for one trial is rarely leveraged on others and as a result, a large amount of work needs to be reproduced for each trial. There are two glaring problems with this approach. First, each analysis of a trial requires a substantial amount of error-prone work. While the variation between trials means some work needs to be done for preparation, exploration, and analysis, there are many aspects of these trials that could be better automated resulting in greater efficiency and accuracy. Second, because this work is challenging, it often occupies the majority of the statisticians effort. Less time is spent on trial design and analysis and the this portion is taken up by a clinician who often has less expertise with the statistical aspects of the trial. As a result, the extra effort spent on processing data undermines statisticians role as a collaborator and relegates them to service provider. Need tools leveraging existing work to more efficiently provide holistic views on trials ","PeriodicalId":20974,"journal":{"name":"R J.","volume":"1 1","pages":"556"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90848279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The need to analyze the dependence between two or more point processes in time appears in many modeling problems related to the occurrence of events, such as the occurrence of climate events at different spatial locations or synchrony detection in spike train analysis. The package IndTestPP provides a general framework for all the steps in this type of analysis, and one of its main features is the implementation of three families of tests to study independence given the intensities of the processes, which are not only useful to assess independence but also to identify factors causing dependence. The package also includes functions for generating different types of dependent point processes, and implements computational statistical inference tools using them. An application to characterize the dependence between the occurrence of extreme heat events in three Spanish locations using the package is shown.
{"title":"Analyzing Dependence between Point Processes in Time Using IndTestPP","authors":"A. Cebrián, J. Asín","doi":"10.32614/rj-2021-049","DOIUrl":"https://doi.org/10.32614/rj-2021-049","url":null,"abstract":"The need to analyze the dependence between two or more point processes in time appears in many modeling problems related to the occurrence of events, such as the occurrence of climate events at different spatial locations or synchrony detection in spike train analysis. The package IndTestPP provides a general framework for all the steps in this type of analysis, and one of its main features is the implementation of three families of tests to study independence given the intensities of the processes, which are not only useful to assess independence but also to identify factors causing dependence. The package also includes functions for generating different types of dependent point processes, and implements computational statistical inference tools using them. An application to characterize the dependence between the occurrence of extreme heat events in three Spanish locations using the package is shown.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"48 1","pages":"499"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90901395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Left censoring can occur with relative frequency when analysing recurrent events in epidemiological studies, especially observational ones. Concretely, the inclusion of individuals that were already at risk before the effective initiation in a cohort study, may cause the unawareness of prior episodes that have already been experienced, and this will easily lead to biased and inefficient estimates. The miRecSurv package is based on the use of models with specific baseline hazard, with multiple imputation of the number of prior episodes when unknown by means of the COMPoisson distribution, a very flexible count distribution that can handle over-, suband equidispersion, with a stratified model depending on whether the individual had or had not previously been at risk, and the use of a frailty term. The usage of the package is illustrated by means of a real data example based on a occupational cohort study and a simulation study.
{"title":"miRecSurv Package: Prentice-Williams-Peterson Models with Multiple Imputation of Unknown Number of Previous Episodes","authors":"D. Moriña, G. Hernández-Herrera, A. Navarro","doi":"10.32614/rj-2021-082","DOIUrl":"https://doi.org/10.32614/rj-2021-082","url":null,"abstract":"Left censoring can occur with relative frequency when analysing recurrent events in epidemiological studies, especially observational ones. Concretely, the inclusion of individuals that were already at risk before the effective initiation in a cohort study, may cause the unawareness of prior episodes that have already been experienced, and this will easily lead to biased and inefficient estimates. The miRecSurv package is based on the use of models with specific baseline hazard, with multiple imputation of the number of prior episodes when unknown by means of the COMPoisson distribution, a very flexible count distribution that can handle over-, suband equidispersion, with a stratified model depending on whether the individual had or had not previously been at risk, and the use of a frailty term. The usage of the package is illustrated by means of a real data example based on a occupational cohort study and a simulation study.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"36 1","pages":"321"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82942226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nora M. Villanueva, M. Sestelo, Luís Meira-Machado, J. Roca-Pardiñas
In many situations it could be interesting to ascertain whether groups of curves can be performed, especially when confronted with a considerable number of curves. This paper introduces an R package, known as clustcurv, for determining clusters of curves with an automatic selection of their number. The package can be used for determining groups in multiple survival curves as well as for multiple regression curves. Moreover, it can be used with large numbers of curves. An illustration of the use of clustcurv is provided, using both real data examples and artificial data.
{"title":"clustcurv: An R Package for Determining Groups in Multiple Curves","authors":"Nora M. Villanueva, M. Sestelo, Luís Meira-Machado, J. Roca-Pardiñas","doi":"10.32614/rj-2021-032","DOIUrl":"https://doi.org/10.32614/rj-2021-032","url":null,"abstract":"In many situations it could be interesting to ascertain whether groups of curves can be performed, especially when confronted with a considerable number of curves. This paper introduces an R package, known as clustcurv, for determining clusters of curves with an automatic selection of their number. The package can be used for determining groups in multiple survival curves as well as for multiple regression curves. Moreover, it can be used with large numbers of curves. An illustration of the use of clustcurv is provided, using both real data examples and artificial data.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"191 1","pages":"164"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77621929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Risk and Performance Estimators Standard Errors package RPESE implements a new method for computing accurate standard errors of risk and performance estimators when returns are serially dependent. The new method makes use of the representation of a risk or performance estimator as a summation of a time series of influence-function (IF) transformed returns, and computes estimator standard errors using a sophisticated method of estimating the spectral density at frequency zero of the time series of IF-transformed returns. Two additional packages used by RPESE are introduced, namely RPEIF which computes and provides graphical displays of the IF of risk and performance estimators, and RPEGLMEN which implements a regularized Gamma generalized linear model polynomial fit to the periodogram of the time series of the IF-transformed returns. A Monte Carlo study shows that the new method provides more accurate estimates of standard errors for risk and performance estimators compared to well-known alternative methods in the presence of serial correlation.
{"title":"RPESE: Risk and Performance Estimators Standard Errors with Serially Dependent Data","authors":"A. Christidis, R. Martin","doi":"10.32614/rj-2021-106","DOIUrl":"https://doi.org/10.32614/rj-2021-106","url":null,"abstract":"The Risk and Performance Estimators Standard Errors package RPESE implements a new method for computing accurate standard errors of risk and performance estimators when returns are serially dependent. The new method makes use of the representation of a risk or performance estimator as a summation of a time series of influence-function (IF) transformed returns, and computes estimator standard errors using a sophisticated method of estimating the spectral density at frequency zero of the time series of IF-transformed returns. Two additional packages used by RPESE are introduced, namely RPEIF which computes and provides graphical displays of the IF of risk and performance estimators, and RPEGLMEN which implements a regularized Gamma generalized linear model polynomial fit to the periodogram of the time series of the IF-transformed returns. A Monte Carlo study shows that the new method provides more accurate estimates of standard errors for risk and performance estimators compared to well-known alternative methods in the presence of serial correlation.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"74 1","pages":"624"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82234110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R is not a programming language, and this produces the inherent dichotomy between analytics and software engineering. With the emergence of data science, the opportunity exists to bridge this gap, especially through teaching practices. Genesis: How did we get here? The article “Software Engineering and R Programming: A Call to Action” summarizes the dichotomy between analytics and software engineering in the R ecosystem, provides examples where this leads to problems and proposes what we as R users can do to bridge the gap. Data Analytic Language The fundamental basis of the dichotomy is inherent in the evolution of S and R: they are not programming languages, but they ended up being mistaken for such. S was designed to be a data analytic language: to turn ideas into software quickly and faithfully, often used in “non-programming” style (Chambers, 1998). Its original goal was to enable the statisticians to apply code which was written in programming languages (at the time mostly FORTRAN) to analyze data quickly and interactively for some suitable definition of “interactive” at the time (Becker, 1994). The success of S and then R can be traced to the ability to perform data analysis by applying existing tools to data in creative ways. A data analysis is a quest at every step we learn more about the data which informs our decision about next steps. Whether it is an exploratory data analysis leveraging graphics or computing statistics or fitting models the final goal is typically not known ahead of time, it is obtained by an iterative process of applying tools that we as analysts think may lead us further (Tukey, 1977). It is important to note that this is exactly the opposite of software engineering where there is a well-defined goal: a specification or desired outcome, which simply needs to be expressed in a way understandable to the computer.
{"title":"The R Quest: from Users to Developers","authors":"Simon Urbanek","doi":"10.32614/rj-2021-111","DOIUrl":"https://doi.org/10.32614/rj-2021-111","url":null,"abstract":"R is not a programming language, and this produces the inherent dichotomy between analytics and software engineering. With the emergence of data science, the opportunity exists to bridge this gap, especially through teaching practices. Genesis: How did we get here? The article “Software Engineering and R Programming: A Call to Action” summarizes the dichotomy between analytics and software engineering in the R ecosystem, provides examples where this leads to problems and proposes what we as R users can do to bridge the gap. Data Analytic Language The fundamental basis of the dichotomy is inherent in the evolution of S and R: they are not programming languages, but they ended up being mistaken for such. S was designed to be a data analytic language: to turn ideas into software quickly and faithfully, often used in “non-programming” style (Chambers, 1998). Its original goal was to enable the statisticians to apply code which was written in programming languages (at the time mostly FORTRAN) to analyze data quickly and interactively for some suitable definition of “interactive” at the time (Becker, 1994). The success of S and then R can be traced to the ability to perform data analysis by applying existing tools to data in creative ways. A data analysis is a quest at every step we learn more about the data which informs our decision about next steps. Whether it is an exploratory data analysis leveraging graphics or computing statistics or fitting models the final goal is typically not known ahead of time, it is obtained by an iterative process of applying tools that we as analysts think may lead us further (Tukey, 1977). It is important to note that this is exactly the opposite of software engineering where there is a well-defined goal: a specification or desired outcome, which simply needs to be expressed in a way understandable to the computer.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"475 1","pages":"697"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79938019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aviation data has become increasingly more accessible to the public thanks to the adoption of technologies such as Automatic Dependent Surveillance-Broadcast (ADS-B) and Mode S, which provide aircraft information over publicly accessible radio channels. Furthermore, the OpenSky Network provides multiple public resources to access such air traffic data from a large network of ADS-B receivers. Here, we present openSkies , the first R package for processing public air traffic data. The package provides an interface to the OpenSky Network resources, standardized data structures to represent the different entities involved in air traffic data and functionalities to analyze and visualize such data. Furthermore, the portability of the implemented data structures makes openSkies easily reusable by other packages, therefore laying the foundation of aviation data engineering in R.
{"title":"openSkies - Integration of Aviation Data into the R Ecosystem","authors":"Rafael Ayala, D. Ayala, L. S. Vidal, David Ruiz","doi":"10.32614/rj-2021-095","DOIUrl":"https://doi.org/10.32614/rj-2021-095","url":null,"abstract":"Aviation data has become increasingly more accessible to the public thanks to the adoption of technologies such as Automatic Dependent Surveillance-Broadcast (ADS-B) and Mode S, which provide aircraft information over publicly accessible radio channels. Furthermore, the OpenSky Network provides multiple public resources to access such air traffic data from a large network of ADS-B receivers. Here, we present openSkies , the first R package for processing public air traffic data. The package provides an interface to the OpenSky Network resources, standardized data structures to represent the different entities involved in air traffic data and functionalities to analyze and visualize such data. Furthermore, the portability of the implemented data structures makes openSkies easily reusable by other packages, therefore laying the foundation of aviation data engineering in R.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"1 1","pages":"485"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89877538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}