Background: Patient recruitment for clinical trials remains a major challenge, with 86% of trials failing to meet enrollment targets on time. In over 77% of cases, recruitment difficulties stem from matching problems between trials and patients. Case-Based Reasoning (CBR) offers a distinct patient-to-patient approach by determining eligibility through comparison with previously enrolled patients, yet this methodology remains underexplored in contemporary oncology trial matching despite its potential advantages.
Objective: To compare the performance of two CBR approaches-random forest (RF) and target patient similarity (TPS)-in predicting patient eligibility for recent oncology clinical trials using real-world electronic health record data.
Methods: We selected three breast cancer clinical trials (2019-2022) from our institutional registry. Patient data were extracted from our clinical data warehouse, including structured data (laboratory results, diagnosis codes, procedures, treatments) and unstructured clinical narratives processed using natural language processing. For each trial, we trained RF classifiers and TPS models using repeated hold-out validation (25 splits, 70/30 train-test). Performance was evaluated using discriminative metrics (AUC, positive precision, recall, F1-score) and ranking metrics (P@5, P@10, MAP, MRR, NDCG@5, NDCG@10). We analyzed model performance across varying numbers of eligible patients in training datasets (2 to 70% of the total number of eligible patients).
Results: Both approaches demonstrated strong discriminative performance across three trials, with average AUCs of 84.1 % for RF and 76.4 % for TPS, driven primarily by high recall (82.3 % and 77.7 %, respectively). However, positive precision remained low (13.3 % and 9.9 %), reflecting high false-positive rates due to class imbalance. RF showed superior ranking performance, particularly for the trial with the largest eligible cohort (n = 542; P@5 = 78.6 %, MRR = 88.0 %), compared to TPS (P@5 = 47.9 %, MRR = 69.2 %). Both approaches reached performance plateaus with only around 10 eligible patients in training datasets. Variable importance analysis revealed that treatment-related features, diagnostic codes, and procedures were consistently the most important predictors, with relevant patterns identified even with minimal training data.
Conclusions: CBR approaches can effectively support patient pre-screening for oncology clinical trials, with RF demonstrating moderately superior performance over TPS. Both methods show robust discriminative performance with small training datasets, though ranking performance varies substantially across trials. Our findings suggest that CBR approaches may benefit from integration with query-based or prompt-based methods during early recruitment phases when training data is scarce.

