Background: A 2×2 factorial design evaluates two interventions (A versus control and B versus control) by randomising to control, A-only, B-only or both A and B together. Extended factorial designs are also possible (e.g. 3×3 or 2×2×2). Factorial designs often require fewer resources and participants than alternative randomised controlled trials, but they are not widely used. We identified several issues that investigators considering this design need to address, before they use it in a late-phase setting.
Methods: We surveyed journal articles published in 2000-2022 relating to designing factorial randomised controlled trials. We identified issues to consider based on these and our personal experiences.
Results: We identified clinical, practical, statistical and external issues that make factorial randomised controlled trials more desirable. Clinical issues are (1) interventions can be easily co-administered; (2) risk of safety issues from co-administration above individual risks of the separate interventions is low; (3) safety or efficacy data are wanted on the combination intervention; (4) potential for interaction (e.g. effect of A differing when B administered) is low; (5) it is important to compare interventions with other interventions balanced, rather than allowing randomised interventions to affect the choice of other interventions; (6) eligibility criteria for different interventions are similar. Practical issues are (7) recruitment is not harmed by testing many interventions; (8) each intervention and associated toxicities is unlikely to reduce either adherence to the other intervention or overall follow-up; (9) blinding is easy to implement or not required. Statistical issues are (10) a suitable scale of analysis can be identified; (11) adjustment for multiplicity is not required; (12) early stopping for efficacy or lack of benefit can be done effectively. External issues are (13) adequate funding is available and (14) the trial is not intended for licensing purposes. An overarching issue (15) is that factorial design should give a lower sample size requirement than alternative designs. Across designs with varying non-adherence, retention, intervention effects and interaction effects, 2×2 factorial designs require lower sample size than a three-arm alternative when one intervention effect is reduced by no more than 24%-48% in the presence of the other intervention compared with in the absence of the other intervention.
Conclusions: Factorial designs are not widely used and should be considered more often using our issues to consider. Low potential for at most small to modest interaction is key, for example, where the interventions have different mechanisms of action or target different aspects of the disease being studied.
Introduction: Funders must make difficult decisions about which squared treatments to prioritize for randomized trials. Earlier research suggests that experts have no ability to predict which treatments will vindicate their promise. We tested whether a brief training module could improve experts' trial predictions.
Methods: We randomized a sample of breast cancer and hematology-oncology experts to the presence or absence of a feedback training module where experts predicted outcomes for five recently completed randomized controlled trials and received feedback on accuracy. Experts then predicted primary outcome attainment for a sample of ongoing randomized controlled trials. Prediction skill was assessed by Brier scores, which measure the average deviation between their predictions and actual outcomes. Secondary outcomes were discrimination (ability to distinguish between positive and non-positive trials) and calibration (higher predictions reflecting higher probability of trials being positive).
Results: A total of 148 experts (46 for breast cancer, 54 for leukemia, and 48 for lymphoma) were randomized between May and December 2017 and included in the analysis (1217 forecasts for 25 trials). Feedback did not improve prediction skill (mean Brier score for control: 0.22, 95% confidence interval = 0.20-0.24 vs feedback arm: 0.21, 95% confidence interval = 0.20-0.23; p = 0.51). Control and feedback arms showed similar discrimination (area under the curve = 0.70 vs 0.73, p = 0.24) and calibration (calibration index = 0.01 vs 0.01, p = 0.81). However, experts in both arms offered predictions that were significantly more accurate than uninformative forecasts of 50% (Brier score = 0.25).
Discussion: A short training module did not improve predictions for cancer trial results. However, expert communities showed unexpected ability to anticipate positive trials.Pre-registration record: https://aspredicted.org/4ka6r.pdf.
Background: Pivotal evidence of efficacy of a new drug is typically generated by (at least) two clinical trials which independently provide statistically significant and mutually corroborating evidence of efficacy based on a primary endpoint. In this situation, showing drug effects on clinically important secondary objectives can be demanding in terms of sample size requirements. Statistically efficient methods to power for such endpoints while controlling the Type I error are needed.
Methods: We review existing strategies for establishing claims on important but sample size-intense secondary endpoints. We present new strategies based on combined data from two independent, identically designed and concurrent trials, controlling the Type I error at the submission level. We explain the methodology and provide three case studies.
Results: Different strategies have been used for establishing secondary claims. One new strategy, involving a protocol planned analysis of combined data across trials, and controlling the Type I error at the submission level, is particularly efficient. It has already been successfully used in support of label claims. Regulatory views on this strategy differ.
Conclusions: Inference on combined data across trials is a useful approach for generating pivotal evidence of efficacy for important but sample size-intense secondary endpoints. It requires careful preparation and regulatory discussion.
Background: Evidence-based methods for randomised controlled trial recruitment and retention are extremely valuable. Despite increased testing of these through studies within a trial, there remains limited high-certainty evidence for effective strategies. In addition, there has been little consideration as to whether recruitment interventions also have an impact on participant retention.
Methods: A systematic review was conducted. Studies were eligible if they were randomised controlled trials using a recruitment intervention and which also assessed the impact of this on retention at any time point. Searches were conducted through MEDLINE, EMBASE, Cochrane Library, and the Northern Ireland Hub for Trials Methodology Research SWAT Repository. Two independent reviewers screened the search results and extracted data for eligible studies using a piloted extraction form.
Results: A total of 7815 records were identified, resulting in 10 studies being included in the review. Most studies (n = 6, 60%) focussed on the information given to participants (n = 6, 60%), with two (20%) focussing on incentives, and two focussing on trial design and recruiter interventions. Due to intervention heterogeneity, none of the interventions could be meta-analysed. Only one study found any statistically significant effect of letters including a photograph (odds ratio: 5.40, 95% CI 1.12-26.15, p = 0.04).
Conclusion: Assessment of the impacts of recruitment strategies, evaluated in a SWAT, on retention of participants in the host trial remains limited. Assessment of the impact of recruitment interventions on retention is recommended to minimise future research costs and waste.
Background/aims: As oncology treatments evolve, classic assumptions of toxicity associated with cytotoxic agents may be less relevant, requiring new design strategies for trials intended to inform dosing strategies for agents that may be administered beyond a set number of defined cycles. We describe the overall incidence of dose-limiting toxicities during and after cycle 1, frequency of reporting subsequent cycle toxicities, and the impact of post-cycle 1 dose-limiting toxicities on conclusions drawn from oncology phase 1 clinical trials.
Methods: We conducted a systematic review of subsequent cycle toxicities in oncology phase I clinical trials published in the Journal of Clinical Oncology from 2000 to 2020. We used chi-square tests and multivariate logistic regression to describe predictors of reporting subsequent cycle toxicity data.
Results: From 2000 to 2020, we identified 489 articles reporting on therapeutic phase 1 clinical trials. Of these, 421 (86%) reported data regarding cycle 1 dose-limiting toxicities and 170 (35%) reported data on cycle 1 dose modifications. Of the trials that reported cycle 1 dose-limiting toxicities, the median percentage of patients that experienced cycle 1 dose-limiting toxicities was 8.89%. Only 47 (9.6%) publications reported on post-cycle 1 dose-limiting toxicities and only 92 (19%) reported on dose modifications beyond cycle 1. Of the trials that reported post-cycle 1 dose-limiting toxicities, the median percentage of patients that experienced post-cycle 1 dose-limiting toxicities was 14.8%. Among the 371 studies with a recommended phase 2 dose, 89% did not report whether post-cycle 1 toxicities impacted the recommended phase 2 dose. More recent year of publication was independently associated with reduced odds of reporting subsequent cycle toxicity.
Conclusion: Reporting of subsequent cycle toxicity is uncommon in oncology phase I clinical trial publications and becoming less common over time. Guidelines for reporting of phase I oncology clinical trials should expand to include toxicity data beyond the first cycle.
Background/aims: We developed an observer disfigurement severity scale for neurofibroma-related plexiform neurofibromas to assess change in plexiform neurofibroma-related disfigurement and evaluated its feasibility, reliability, and validity.
Methods: Twenty-eight raters, divided into four cohorts based on neurofibromatosis type 1 familiarity and clinical experience, were shown photographs of children in a clinical trial (NCT01362803) at baseline and 1 year on selumetinib treatment for plexiform neurofibromas (n = 20) and of untreated participants with plexiform neurofibromas (n = 4). Raters, blinded to treatment and timepoint, completed the 0-10 disfigurement severity score for plexiform neurofibroma on each image (0 = not at all disfigured, 10 = very disfigured). Raters evaluated the ease of completing the scale, and a subset repeated the procedure to assess intra-rater reliability.
Results: Mean baseline disfigurement severity score for plexiform neurofibroma ratings were similar for the selumetinib group (6.23) and controls (6.38). Mean paired differences between pre- and on-treatment ratings was -1.01 (less disfigurement) in the selumetinib group and 0.09 in the control (p = 0.005). For the disfigurement severity score for plexiform neurofibroma ratings, there was moderate-to-substantial agreement within rater cohorts (weighted kappa range = 0.46-0.66) and agreement between scores of the same raters at repeat sessions (p > 0.05). In the selumetinib group, change in disfigurement severity score for plexiform neurofibroma ratings was moderately correlated with change in plexiform neurofibroma volume with treatment (r = 0.60).
Conclusion: This study demonstrates that our observer-rated disfigurement severity score for plexiform neurofibroma was feasible, reliable, and documented improvement in disfigurement in participants with plexiform neurofibroma shrinkage. Prospective studies in larger samples are needed to validate this scale further.
Background/aims: Showing "similar efficacy" of a less intensive treatment typically requires a non-inferiority trial. Yet such trials may be challenging to design and conduct. In acute promyelocytic leukemia, great progress has been achieved with the introduction of targeted therapies, but toxicity remains a major clinical issue. There is a pressing need to show the favorable benefit/risk of less intensive treatment regimens.
Methods: We designed a clinical trial that uses generalized pairwise comparisons of five prioritized outcomes (alive and event-free at 2 years, grade 3/4 documented infections, differentiation syndrome, hepatotoxicity, and neuropathy) to confirm a favorable benefit/risk of a less intensive treatment regimen. We conducted simulations based on historical data and assumptions about the differences expected between the standard of care and the less intensive treatment regimen to calculate the sample size required to have high power to show a positive Net Treatment Benefit in favor of the less intensive treatment regimen.
Results: Across 10,000 simulations, average sample sizes of 260 to 300 patients are required for a trial using generalized pairwise comparisons to detect typical Net Treatment Benefits of 0.19 (interquartile range 0.14-0.23 for a sample size of 280). The Net Treatment Benefit is interpreted as a difference between the probability of doing better on the less intensive treatment regimen than on the standard of care, minus the probability of the opposite situation. A Net Treatment Benefit of 0.19 translates to a number needed to treat of about 5.3 patients (1/0.19 ≃ 5.3).
Conclusion: Generalized pairwise comparisons allow for simultaneous assessment of efficacy and safety, with priority given to the former. The sample size required would be of the order of 300 patients, as compared with more than 700 patients for a non-inferiority trial using a margin of 4% against the less intensive treatment regimen for the absolute difference in event-free survival at 2 years, as considered here.