Aude Allemang-Trivalle, Annabel Maruani, Bruno Giraudeau
In the individual stepped-wedge randomized trial (ISW-RT), subjects are allocated to sequences, each sequence being defined by a control period followed by an experimental period. The total follow-up time is the same for all sequences, but the duration of the control and experimental periods varies among sequences. To our knowledge, there is no validated sample size calculation formula for ISW-RTs unlike stepped-wedge cluster randomized trials (SW-CRTs). The objective of this study was to adapt the formula used for SW-CRTs to the case of individual randomization and to validate this adaptation using a Monte Carlo simulation study. The proposed sample size calculation formula for an ISW-RT design yielded satisfactory empirical power for most scenarios except scenarios with operating characteristic values near the boundary (i.e., smallest possible number of periods, very high or very low autocorrelation coefficient). Overall, the results provide useful insights into the sample size calculation for ISW-RTs.
{"title":"Sample Size Calculation for an Individual Stepped-Wedge Randomized Trial","authors":"Aude Allemang-Trivalle, Annabel Maruani, Bruno Giraudeau","doi":"10.1002/bimj.202300167","DOIUrl":"10.1002/bimj.202300167","url":null,"abstract":"<p>In the individual stepped-wedge randomized trial (ISW-RT), subjects are allocated to sequences, each sequence being defined by a control period followed by an experimental period. The total follow-up time is the same for all sequences, but the duration of the control and experimental periods varies among sequences. To our knowledge, there is no validated sample size calculation formula for ISW-RTs unlike stepped-wedge cluster randomized trials (SW-CRTs). The objective of this study was to adapt the formula used for SW-CRTs to the case of individual randomization and to validate this adaptation using a Monte Carlo simulation study. The proposed sample size calculation formula for an ISW-RT design yielded satisfactory empirical power for most scenarios except scenarios with operating characteristic values near the boundary (i.e., smallest possible number of periods, very high or very low autocorrelation coefficient). Overall, the results provide useful insights into the sample size calculation for ISW-RTs.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 5","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300167","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141581614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Raphael O. Betschart, Cristian Riccio, Domingo Aguilera-Garcia, Stefan Blankenberg, Linlin Guo, Holger Moch, Dagmar Seidl, Hugo Solleder, Felix Thalén, Alexandre Thiéry, Raphael Twerenbold, Tanja Zeller, Martin Zoche, Andreas Ziegler
Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg–Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.
{"title":"Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control","authors":"Raphael O. Betschart, Cristian Riccio, Domingo Aguilera-Garcia, Stefan Blankenberg, Linlin Guo, Holger Moch, Dagmar Seidl, Hugo Solleder, Felix Thalén, Alexandre Thiéry, Raphael Twerenbold, Tanja Zeller, Martin Zoche, Andreas Ziegler","doi":"10.1002/bimj.202300278","DOIUrl":"10.1002/bimj.202300278","url":null,"abstract":"<p>Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg–Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 5","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300278","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141581613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siyuan Guo, Jiajia Zhang, Yichao Wu, Alexander C. McLain, James W. Hardin, Bankole Olatosi, Xiaoming Li
Motivated by improving the prediction of the human immunodeficiency virus (HIV) suppression status using electronic health records (EHR) data, we propose a functional multivariable logistic regression model, which accounts for the longitudinal binary process and continuous process simultaneously. Specifically, the longitudinal measurements for either binary or continuous variables are modeled by functional principal components analysis, and their corresponding functional principal component scores are used to build a logistic regression model for prediction. The longitudinal binary data are linked to underlying Gaussian processes. The estimation is done using penalized spline for the longitudinal continuous and binary data. Group-lasso is used to select longitudinal processes, and the multivariate functional principal components analysis is proposed to revise functional principal component scores with the correlation. The method is evaluated via comprehensive simulation studies and then applied to predict viral suppression using EHR data for people living with HIV in South Carolina.
{"title":"Functional Multivariable Logistic Regression With an Application to HIV Viral Suppression Prediction","authors":"Siyuan Guo, Jiajia Zhang, Yichao Wu, Alexander C. McLain, James W. Hardin, Bankole Olatosi, Xiaoming Li","doi":"10.1002/bimj.202300081","DOIUrl":"10.1002/bimj.202300081","url":null,"abstract":"<p>Motivated by improving the prediction of the human immunodeficiency virus (HIV) suppression status using electronic health records (EHR) data, we propose a functional multivariable logistic regression model, which accounts for the longitudinal binary process and continuous process simultaneously. Specifically, the longitudinal measurements for either binary or continuous variables are modeled by functional principal components analysis, and their corresponding functional principal component scores are used to build a logistic regression model for prediction. The longitudinal binary data are linked to underlying Gaussian processes. The estimation is done using penalized spline for the longitudinal continuous and binary data. Group-lasso is used to select longitudinal processes, and the multivariate functional principal components analysis is proposed to revise functional principal component scores with the correlation. The method is evaluated via comprehensive simulation studies and then applied to predict viral suppression using EHR data for people living with HIV in South Carolina.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 5","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300081","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141536065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Closed testing has recently been shown to be optimal for simultaneous true discovery proportion control. It is, however, challenging to construct true discovery guarantee procedures in such a way that it focuses power on some feature sets chosen by users based on their specific interest or expertise. We propose a procedure that allows users to target power on prespecified feature sets, that is, “focus sets.” Still, the method also allows inference for feature sets chosen post hoc, that is, “nonfocus sets,” for which we deduce a true discovery lower confidence bound by interpolation. Our procedure is built from partial true discovery guarantee procedures combined with Holm's procedure and is a conservative shortcut to the closed testing procedure. A simulation study confirms that the statistical power of our method is relatively high for focus sets, at the cost of power for nonfocus sets, as desired. In addition, we investigate its power property for sets with specific structures, for example, trees and directed acyclic graphs. We also compare our method with AdaFilter in the context of replicability analysis. The application of our method is illustrated with a gene ontology analysis in gene expression data.
{"title":"Combining Partial True Discovery Guarantee Procedures","authors":"Ningning Xu, Aldo Solari, Jelle J. Goeman","doi":"10.1002/bimj.202300075","DOIUrl":"10.1002/bimj.202300075","url":null,"abstract":"<p>Closed testing has recently been shown to be optimal for simultaneous true discovery proportion control. It is, however, challenging to construct true discovery guarantee procedures in such a way that it focuses power on some feature sets chosen by users based on their specific interest or expertise. We propose a procedure that allows users to target power on prespecified feature sets, that is, “focus sets.” Still, the method also allows inference for feature sets chosen post hoc, that is, “nonfocus sets,” for which we deduce a true discovery lower confidence bound by interpolation. Our procedure is built from partial true discovery guarantee procedures combined with Holm's procedure and is a conservative shortcut to the closed testing procedure. A simulation study confirms that the statistical power of our method is relatively high for focus sets, at the cost of power for nonfocus sets, as desired. In addition, we investigate its power property for sets with specific structures, for example, trees and directed acyclic graphs. We also compare our method with AdaFilter in the context of replicability analysis. The application of our method is illustrated with a gene ontology analysis in gene expression data.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 5","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300075","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141494375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sören Budig, Klaus Jung, Mario Hasler, Frank Schaarschmidt
In biomedical research, the simultaneous inference of multiple binary endpoints may be of interest. In such cases, an appropriate multiplicity adjustment is required that controls the family-wise error rate, which represents the probability of making incorrect test decisions. In this paper, we investigate two approaches that perform single-step