Background/aims: Clinical trials require numerous documents to be written: Protocols, consent forms, clinical study reports, and many others. Large language models offer the potential to rapidly generate first-draft versions of these documents; however, there are concerns about the quality of their output. Here, we report an evaluation of how good large language models are at generating sections of one such document, clinical trial protocols.
Methods: Using an off-the-shelf large language model, we generated protocol sections for a broad range of diseases and clinical trial phases. Each of these document sections we assessed across four dimensions: Clinical thinking and logic; Transparency and references; Medical and clinical terminology; and Content relevance and suitability. To improve performance, we used the retrieval-augmented generation method to enhance the large language model with accurate up-to-date information, including regulatory guidance documents and data from ClinicalTrials.gov. Using this retrieval-augmented generation large language model, we regenerated the same protocol sections and assessed them across the same four dimensions.
Results: We find that the off-the-shelf large language model delivers reasonable results, especially when assessing content relevance and the correct use of medical and clinical terminology, with scores of over 80%. However, the off-the-shelf large language model shows limited performance in clinical thinking and logic and transparency and references, with assessment scores of ≈40% or less. The use of retrieval-augmented generation substantially improves the writing quality of the large language model, with clinical thinking and logic and transparency and references scores increasing to ≈80%. The retrieval-augmented generation method thus greatly improves the practical usability of large language models for clinical trial-related writing.
Discussion: Our results suggest that hybrid large language model architectures, such as the retrieval-augmented generation method we utilized, offer strong potential for clinical trial-related writing, including a wide variety of documents. This is potentially transformative, since it addresses several major bottlenecks of drug development.
Background/aims: Sample size determination for cluster randomised trials is challenging because it requires robust estimation of the intra-cluster correlation coefficient. Typically, the sample size is chosen to provide a certain level of power to reject the null hypothesis in a two-sample hypothesis test. This relies on the minimal clinically important difference and estimates for the overall standard deviation, the intra-cluster correlation coefficient and, if cluster sizes are assumed to be unequal, the coefficient of variation of the cluster size. Varying any of these parameters can have a strong effect on the required sample size. In particular, it is very sensitive to small differences in the intra-cluster correlation coefficient. A relevant intra-cluster correlation coefficient estimate is often not available, or the available estimate is imprecise due to being based on studies with low numbers of clusters. If the intra-cluster correlation coefficient value used in the power calculation is far from the unknown true value, this could lead to trials which are substantially over- or under-powered.
Methods: In this article, we propose a hybrid approach using Bayesian assurance to determine the sample size for a cluster randomised trial in combination with a frequentist analysis. Assurance is an alternative to traditional power, which incorporates the uncertainty on key parameters through a prior distribution. We suggest specifying prior distributions for the overall standard deviation, intra-cluster correlation coefficient and coefficient of variation of the cluster size, while still utilising the minimal clinically important difference. We illustrate the approach through the design of a cluster randomised trial in post-stroke incontinence and compare the results to those obtained from a standard power calculation.
Results: We show that assurance can be used to calculate a sample size based on an elicited prior distribution for the intra-cluster correlation coefficient, whereas a power calculation discards all of the information in the prior except for a single point estimate. Results show that this approach can avoid misspecifying sample sizes when the prior medians for the intra-cluster correlation coefficient are very similar, but the underlying prior distributions exhibit quite different behaviour. Incorporating uncertainty on all three of the nuisance parameters, rather than only on the intra-cluster correlation coefficient, does not notably increase the required sample size.
Conclusion: Assurance provides a better understanding of the probability of success of a trial given a particular minimal clinically important difference and can be used instead of power to produce sample sizes that are more robust to parameter uncertainty. This is especially useful when there is difficulty obtaining reliable parameter estimates.
Background: Platform trials typically feature a shared control arm and multiple experimental treatment arms. Staggered entry and exit of arms splits the control group into two cohorts: those randomized during the same period in which the experimental arm was open (concurrent controls) and those randomized outside that period (nonconcurrent controls). Combining these control groups may offer increased statistical power but can lead to bias if analyses do not account for time trends in the response variable. Proposed methods of adjustment for time may increase type I error rates when time trends impact arms unequally or when large, sudden changes to the response rate occur. However, there has been limited exploration of the degree of type I error inflation one can plausibly expect in real-world scenarios.
Methods: We use data from the Adaptive COVID-19 Treatment Trial (ACTT) to mimic a realistic platform trial with a remdesivir control arm. We compare four strategies for estimating the effect of interferon beta-1a (the ACTT-3 experimental arm) relative to remdesivir (data from ACTT-1, ACTT-2, and ACTT-3) on recovery and death by day 29: utilizing concurrent controls only (the prespecified analysis), pooling all remdesivir arm data without adjustment (the "unadjusted-pooled" analysis), adjusting for time as a categorical variable, and a Bayesian hierarchical model implementation which adjusts for time trends using smoothing techniques (the "Bayesian time machine"). We compare type I error rates and relative efficiency of each method in simulation settings based on observed ACTT remdesivir arm data.
Results: The unadjusted-pooled approach provided substantially different estimates of the effect of interferon beta-1a relative to remdesivir compared with the concurrent-only and model-based approaches, indicating that changes in recovery and death rates over time were not ignorable across different stages of ACTT. The model-based approaches rely on an assumption of constant treatment effects for each arm in the platform relative to control; error rates more than doubled in settings where this was not satisfied. Relative efficiency of the model-based approaches compared with the concurrent-only analysis was moderate.
Conclusions: In simulation settings where key model assumptions were not met, potential efficiency gains from incorporation of nonconcurrent controls were outweighed by the risk of substantial type I error rate inflation. This leads us to advise against these strategies for primary analyses in confirmatory clinical trials, aligning with current FDA guidance advising against comparisons to nonconcurrent controls in COVID-19 settings. The model-based adjustment methods may be useful in other settings, but we recommend performing the concurrent-only analysis as a reference for assessing the degree to which nonconcurrent controls drive results.
There is growing interest in using embedded research methods, particularly pragmatic clinical trials, to address well-known evidentiary shortcomings afflicting the health care system. Reviews of pragmatic clinical trials published between 2014 and 2019 found that 8.8% were conducted with waivers of informed consent; furthermore, the number of trials where consent is not obtained is increasing with time. From a regulatory perspective, waivers of informed consent are permissible when certain conditions are met, including that the study involves no more than minimal risk, that it could not practicably be carried out without a waiver, and that waiving consent does not violate participants' rights and welfare. Nevertheless, when research is conducted with a waiver of consent, several ethical challenges arise. We must consider how to: address empirical evidence showing that patients and members of the public generally prefer prospective consent, demonstrate respect for persons using tools other than consent, promote public trust and investigator integrity, and ensure an adequate level of participant protections. In this article, we use examples drawn from real pragmatic clinical trials to argue that prospective consultation with representatives of the target study population can address, or at least mitigate, many of the ethical challenges posed by waivers of informed consent. We also consider what consultation might involve to illustrate its feasibility and address potential objections.
Background/aims: Including women of childbearing age in a clinical trial makes it necessary to consider two factors from a bioethical perspective: first, the lack of knowledge about the potential teratogenic effects of an investigational product, and also, the principle of justice not to exclude any population from the benefits of research. The most common way to address this issue is by requiring volunteers to use contraceptives before, during, and a few weeks after the clinical trial. This work presents all the strategies used to promote contraception use and prevent pregnancy during the Alzheimer's Prevention Initiative Autosomal-Dominant Alzheimer's Disease (API ADAD) Colombia clinical trial. Two characteristics of this trial make it of special interest for closely monitoring contraception use. One is that the trial lasted more than 7 years, and the other is that participants could be carriers of the E280A PSEN1 mutation, leading to a mild cognitive impairment as early as their late 30s.
Methods: An individual medical evaluation to select the contraception method that best fits the volunteer was carried out during the screening visit, remitting to the gynecologist when necessary. All non-surgical contraception methods were supplied by the sponsor. Staff were trained on contraception counseling, correctly dispensing contraceptive drugs to volunteers, and identifying, reporting, and following up on pregnancies. Two comprehensive educational campaigns on contraception use were performed, and the intervention included all volunteers. In addition, volunteers were asked on an annual survey to evaluate the dispensing procedure. Finally, the effectiveness of these strategies was retrospectively evaluated, comparing by extrapolation the number of pregnancies presented throughout the trial with the General Fertility Rate in Colombia.
Results: A total of 159 female volunteers were recruited. All strategies were implemented as planned, even during the COVID-19 contingency. Ten pregnancies occurred during the evaluation period (2015-2021). Two were planned; the rest were associated with a potential therapeutic failure or incorrect use of contraceptive methods for a contraceptive failure of 0.49% per year. Sixty percent of pregnancies led to an abortion, either miscarriage or therapeutic abortion. However, there was not enough data to associate the pregnancy outcome with the administration of the investigational product. Finally, we observed a lower fertility rate in women participating in the trial compared to the Colombian population.
Conclusion: The lower rates of contraceptive failure and the decrease in the incidence of pregnancies in women participating in the trial compared to the Colombian population across the 7 years of evaluation suggest that the strategies used in API ADAD Colombia were adequate and effective in addressing contraception use.
Background: Concerns about low accrual have long been a standard part of the discourse on cancer clinical trials, reaching even as far as the news media. Indeed, so many trials are closed before completing accrual that a cottage industry has recently developed creating statistical models to predict trial failure. We previously proposed four methodologic fixes for the current crisis in clinical trials: (1) dramatically reducing the number of eligibility criteria, (2) using data routinely collected in clinical practice for trial endpoints; then lowering barriers to accrual by (3) cluster randomization or (4) staged consent.
Methods: We report our practical experience of applying these fixes to randomized trials at Memorial Sloan Kettering Cancer Center.
Results: We have completed seven single-center randomized trials, with several more underway and accruing rapidly, with a total accrual approaching 10,000. Many of the trials have compared surgical interventions, an area where trials have traditionally been hard to complete. Only one of these trials was externally funded. While low costs were possible due to the existing research infrastructure at our institution, such infrastructure is common at major cancer centers.
Conclusions: Further research on innovative clinical trial designs is warranted, particularly in higher-stakes settings, and in trials of medical and radiotherapy interventions.
Background/aims: Self-reported questionnaires on health status after randomized trials can be time-consuming, costly, and potentially unreliable. Administrative data sets may provide cost-effective, less biased information, but it is uncertain how administrative and self-reported data compare to identify chronic conditions in a New Zealand cohort. This study aimed to determine whether record linkage could replace self-reported questionnaires to identify chronic conditions that were the outcomes of interest for trial follow-up.
Methods: Participants in 50-year follow-up of a randomized trial were asked to complete a questionnaire and to consent to accessing administrative data. The proportion of participants with diabetes, pre-diabetes, hyperlipidaemia, hypertension, mental health disorders, and asthma was calculated using each data source and agreement between data sources assessed.
Results: Participants were aged 49 years (SD = 1, n = 424, 50% male). Agreement between questionnaire and administrative data was slight for pre-diabetes (kappa = 0.10), fair for hyperlipidaemia (kappa = 0.27), substantial for diabetes (kappa = 0.65), and moderate for other conditions (all kappa >0.42). Administrative data alone identified two to three times more cases than the questionnaire for all outcomes except hypertension and mental health disorders, where the questionnaire alone identified one to two times more cases than administrative data. Combining all sources increased case detection for all outcomes.
Conclusions: A combination of questionnaire, pharmaceutical, and laboratory data with expert panel review were required to identify participants with chronic conditions of interest in this follow-up of a clinical trial.