首页 > 最新文献

BMC Medical Research Methodology最新文献

英文 中文
Cohort retention in a pandemic response study: lessons from the SARS-CoV2 Immunity & Reinfection Evaluation (SIREN) study.
IF 3.9 3区 医学 Q1 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-01-30 DOI: 10.1186/s12874-025-02469-6
Anna Howells, Katie Munro, Sarah Foulkes, Atiya Kamal, Jack Haywood, Sophie Russell, Dominic Sparkes, Erika Aquino, Jennie Evans, Dale Weston, Susan Hopkins, Jasmin Islam, Victoria Hall
<p><strong>Background: </strong>SIREN is a healthcare worker cohort study aiming to determine COVID-19 incidence, duration of immunity and vaccine effectiveness across 135 NHS organisations in four UK nations. Conducting an intensive prospective cohort study during a pandemic was challenging. We designed an evolving retention programme, informed by emerging evidence on best practice. This included applying a multifactorial approach, and considering strategies for barrier reduction, community building, follow-up, and tracing. We utilised participant engagement tools underpinned by our Participant Involvement Panel (PIP) and here we evaluate cohort retention over time and identify learnings.</p><p><strong>Methods: </strong>A mixed method evaluation of cohort retention in 12 and 24-month follow-up (June 2020 - March 2023). We described cohort retention by demographics and site, using odds ratios from logistic regression. Withdrawal reasons during this time were collected by survey. We collected participant feedback via cross-sectional online survey conducted October - November 2022, utilising a behavioural science approach. We conducted two focus groups with research teams in February 2023 and conducted thematic analysis exploring cohort retention challenges and facilitators.</p><p><strong>Results: </strong>37,275 (84.7%) participants completed 12-months of follow-up. Of 14,772 participants extending their follow-up to 24 months, 12,635 (85.5%) completed this. Retention increased with age in the 12 (55-64 years vs < 25 years OR = 2.50; 95% CI: 2.19-2.85; p < 0.001) and 24-month (> 65 years vs < 25 years OR = 2.92; 95% CI: 1.78-4.88; p < 0.001) cohorts. Retention was highest in the Asian and Black ethnic groups compared to White in the 12 (OR = 1.38; 95% CI: 1.23-1.56; p < 0.001, and OR = 1.64; 95% CI: 1.30-2.08; p < 0.001) and 24-month (OR = 1.78; 95% CI: 1.42-2.25; p < 0.001, and OR = 2.12; 95% CI: 1.41-3.35; p < 0.001) cohort. Among participants withdrawing, the median time in follow-up at withdrawal was 7 months (IQR: 4-10 months) within the 12-month cohort and 19 months within the 24-month cohort (IQR: 16-22 months). The top three reasons for participant withdrawal were workload, leaving site employment and medical reasons. Themes identified from focus-groups included: the need to monitor and understand participant motivation over time, the necessity of inclusive and comprehensive communication, the importance of acknowledging participant contributions, building collaboration with local research teams, and investing in the research team skillset.</p><p><strong>Conclusion: </strong>Participant retention in the SIREN study remained high over 24-months of intensive follow-up, demonstrating that large cohort studies are feasible as a pandemic research tool. Our evaluation suggests it is possible to maintain an engaged cohort of healthcare workers (HCWs) during an acute pandemic response phase. The insights gained from this population group are impor
{"title":"Cohort retention in a pandemic response study: lessons from the SARS-CoV2 Immunity & Reinfection Evaluation (SIREN) study.","authors":"Anna Howells, Katie Munro, Sarah Foulkes, Atiya Kamal, Jack Haywood, Sophie Russell, Dominic Sparkes, Erika Aquino, Jennie Evans, Dale Weston, Susan Hopkins, Jasmin Islam, Victoria Hall","doi":"10.1186/s12874-025-02469-6","DOIUrl":"10.1186/s12874-025-02469-6","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;SIREN is a healthcare worker cohort study aiming to determine COVID-19 incidence, duration of immunity and vaccine effectiveness across 135 NHS organisations in four UK nations. Conducting an intensive prospective cohort study during a pandemic was challenging. We designed an evolving retention programme, informed by emerging evidence on best practice. This included applying a multifactorial approach, and considering strategies for barrier reduction, community building, follow-up, and tracing. We utilised participant engagement tools underpinned by our Participant Involvement Panel (PIP) and here we evaluate cohort retention over time and identify learnings.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;A mixed method evaluation of cohort retention in 12 and 24-month follow-up (June 2020 - March 2023). We described cohort retention by demographics and site, using odds ratios from logistic regression. Withdrawal reasons during this time were collected by survey. We collected participant feedback via cross-sectional online survey conducted October - November 2022, utilising a behavioural science approach. We conducted two focus groups with research teams in February 2023 and conducted thematic analysis exploring cohort retention challenges and facilitators.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;37,275 (84.7%) participants completed 12-months of follow-up. Of 14,772 participants extending their follow-up to 24 months, 12,635 (85.5%) completed this. Retention increased with age in the 12 (55-64 years vs &lt; 25 years OR = 2.50; 95% CI: 2.19-2.85; p &lt; 0.001) and 24-month (&gt; 65 years vs &lt; 25 years OR = 2.92; 95% CI: 1.78-4.88; p &lt; 0.001) cohorts. Retention was highest in the Asian and Black ethnic groups compared to White in the 12 (OR = 1.38; 95% CI: 1.23-1.56; p &lt; 0.001, and OR = 1.64; 95% CI: 1.30-2.08; p &lt; 0.001) and 24-month (OR = 1.78; 95% CI: 1.42-2.25; p &lt; 0.001, and OR = 2.12; 95% CI: 1.41-3.35; p &lt; 0.001) cohort. Among participants withdrawing, the median time in follow-up at withdrawal was 7 months (IQR: 4-10 months) within the 12-month cohort and 19 months within the 24-month cohort (IQR: 16-22 months). The top three reasons for participant withdrawal were workload, leaving site employment and medical reasons. Themes identified from focus-groups included: the need to monitor and understand participant motivation over time, the necessity of inclusive and comprehensive communication, the importance of acknowledging participant contributions, building collaboration with local research teams, and investing in the research team skillset.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusion: &lt;/strong&gt;Participant retention in the SIREN study remained high over 24-months of intensive follow-up, demonstrating that large cohort studies are feasible as a pandemic research tool. Our evaluation suggests it is possible to maintain an engaged cohort of healthcare workers (HCWs) during an acute pandemic response phase. The insights gained from this population group are impor","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"25 1","pages":"27"},"PeriodicalIF":3.9,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11783804/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143063676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Danish Drowning Cohort: Utstein-style data from fatal and non-fatal drowning incidents in Denmark.
IF 3.9 3区 医学 Q1 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-01-30 DOI: 10.1186/s12874-025-02483-8
Niklas Breindahl, Kasper Bitzer, Oliver B Sørensen, Alexander Wildenschild, Signe A Wolthers, Tim Lindskou, Jacob Steinmetz, Stig N F Blomberg, Helle C Christensen

Background: Effective interventions to reduce drowning incidents require accurate and reliable data for scientific analysis. However, the lack of high-quality evidence and the variability in drowning terminology, definitions, and outcomes present significant challenges in assessing studies to inform drowning guidelines. Many drowning reports use inappropriate classifications for drowning incidents, which significantly contributes to the underreporting of drowning. In particular, non-fatal drowning incidents are underreported because many countries do not routinely collect this data.

The danish drowning cohort: The Danish Drowning Cohort was established in 2016 to facilitate research to improve preventative, rescue, and treatment interventions to reduce the incidence, mortality, and morbidity of drowning. The Danish Drowning Cohort contains nationwide data on all fatal and non-fatal drowning incidents treated by the Danish Emergency Medical Services. Data are extracted from the Danish prehospital electronic medical record using a text-search algorithm (Danish Drowning Formula) and a manual validation process. The WHO definition of drowning, supported by the clarification statement for non-fatal drowning, is used as the case definition to identify drowning. All drowning patients are included, including unwitnessed incidents, non-conveyed patients, patients declared dead prehospital, or patients with obvious clinical signs of irreversible death. This method allows syndromic surveillance and monitors a nationwide cohort of fatal and non-fatal drowning incidents in near-real time to inform future prevention strategies. The Danish Drowning Cohort complies with the Utstein style for drowning reporting guidelines. The 30-day mortality is obtained through the Civil Personal Register to differentiate between fatal and non-fatal drowning incidents. In addition to prehospital data, new data linkages with other Danish registries via the patient's civil registration number will enable the examination of various additional factors associated with drowning risk.

Conclusion: The Danish Drowning Cohort contains nationwide prehospital data on all fatal and non-fatal drowning incidents treated by the Danish Emergency Medical Service. It is a basis for all research on drowning in Denmark and may improve preventative, rescue, and treatment interventions to reduce the incidence, mortality, and morbidity of drowning.

{"title":"The Danish Drowning Cohort: Utstein-style data from fatal and non-fatal drowning incidents in Denmark.","authors":"Niklas Breindahl, Kasper Bitzer, Oliver B Sørensen, Alexander Wildenschild, Signe A Wolthers, Tim Lindskou, Jacob Steinmetz, Stig N F Blomberg, Helle C Christensen","doi":"10.1186/s12874-025-02483-8","DOIUrl":"10.1186/s12874-025-02483-8","url":null,"abstract":"<p><strong>Background: </strong>Effective interventions to reduce drowning incidents require accurate and reliable data for scientific analysis. However, the lack of high-quality evidence and the variability in drowning terminology, definitions, and outcomes present significant challenges in assessing studies to inform drowning guidelines. Many drowning reports use inappropriate classifications for drowning incidents, which significantly contributes to the underreporting of drowning. In particular, non-fatal drowning incidents are underreported because many countries do not routinely collect this data.</p><p><strong>The danish drowning cohort: </strong>The Danish Drowning Cohort was established in 2016 to facilitate research to improve preventative, rescue, and treatment interventions to reduce the incidence, mortality, and morbidity of drowning. The Danish Drowning Cohort contains nationwide data on all fatal and non-fatal drowning incidents treated by the Danish Emergency Medical Services. Data are extracted from the Danish prehospital electronic medical record using a text-search algorithm (Danish Drowning Formula) and a manual validation process. The WHO definition of drowning, supported by the clarification statement for non-fatal drowning, is used as the case definition to identify drowning. All drowning patients are included, including unwitnessed incidents, non-conveyed patients, patients declared dead prehospital, or patients with obvious clinical signs of irreversible death. This method allows syndromic surveillance and monitors a nationwide cohort of fatal and non-fatal drowning incidents in near-real time to inform future prevention strategies. The Danish Drowning Cohort complies with the Utstein style for drowning reporting guidelines. The 30-day mortality is obtained through the Civil Personal Register to differentiate between fatal and non-fatal drowning incidents. In addition to prehospital data, new data linkages with other Danish registries via the patient's civil registration number will enable the examination of various additional factors associated with drowning risk.</p><p><strong>Conclusion: </strong>The Danish Drowning Cohort contains nationwide prehospital data on all fatal and non-fatal drowning incidents treated by the Danish Emergency Medical Service. It is a basis for all research on drowning in Denmark and may improve preventative, rescue, and treatment interventions to reduce the incidence, mortality, and morbidity of drowning.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"25 1","pages":"28"},"PeriodicalIF":3.9,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11783961/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143063679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Propensity Score Matching: should we use it in designing observational studies?
IF 3.9 3区 医学 Q1 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-01-29 DOI: 10.1186/s12874-025-02481-w
Fei Wan

Background: Propensity Score Matching (PSM) stands as a widely embraced method in comparative effectiveness research. PSM crafts matched datasets, mimicking some attributes of randomized designs, from observational data. In a valid PSM design where all baseline confounders are measured and matched, the confounders would be balanced, allowing the treatment status to be considered as if it were randomly assigned. Nevertheless, recent research has unveiled a different facet of PSM, termed "the PSM paradox". As PSM approaches exact matching by progressively pruning matched sets in order of decreasing propensity score distance, it can paradoxically lead to greater covariate imbalance, heightened model dependence, and increased bias, contrary to its intended purpose.

Methods: We used analytic formula, simulation, and literature to demonstrate that this paradox stems from the misuse of metrics for assessing chance imbalance and bias.

Results: Firstly, matched pairs typically exhibit different covariate values despite having identical propensity scores. However, this disparity represents a "chance" difference and will average to zero over a large number of matched pairs. Common distance metrics cannot capture this "chance" nature in covariate imbalance, instead reflecting increasing variability in chance imbalance as units are pruned and the sample size diminishes. Secondly, the largest estimate among numerous fitted models, because of uncertainty among researchers over the correct model, was used to determine statistical bias. This cherry-picking procedure ignores the most significant benefit of matching design-reducing model dependence based on its robustness against model misspecification bias.

Conclusions: We conclude that the PSM paradox is not a legitimate concern and should not stop researchers from using PSM designs.

{"title":"Propensity Score Matching: should we use it in designing observational studies?","authors":"Fei Wan","doi":"10.1186/s12874-025-02481-w","DOIUrl":"10.1186/s12874-025-02481-w","url":null,"abstract":"<p><strong>Background: </strong>Propensity Score Matching (PSM) stands as a widely embraced method in comparative effectiveness research. PSM crafts matched datasets, mimicking some attributes of randomized designs, from observational data. In a valid PSM design where all baseline confounders are measured and matched, the confounders would be balanced, allowing the treatment status to be considered as if it were randomly assigned. Nevertheless, recent research has unveiled a different facet of PSM, termed \"the PSM paradox\". As PSM approaches exact matching by progressively pruning matched sets in order of decreasing propensity score distance, it can paradoxically lead to greater covariate imbalance, heightened model dependence, and increased bias, contrary to its intended purpose.</p><p><strong>Methods: </strong>We used analytic formula, simulation, and literature to demonstrate that this paradox stems from the misuse of metrics for assessing chance imbalance and bias.</p><p><strong>Results: </strong>Firstly, matched pairs typically exhibit different covariate values despite having identical propensity scores. However, this disparity represents a \"chance\" difference and will average to zero over a large number of matched pairs. Common distance metrics cannot capture this \"chance\" nature in covariate imbalance, instead reflecting increasing variability in chance imbalance as units are pruned and the sample size diminishes. Secondly, the largest estimate among numerous fitted models, because of uncertainty among researchers over the correct model, was used to determine statistical bias. This cherry-picking procedure ignores the most significant benefit of matching design-reducing model dependence based on its robustness against model misspecification bias.</p><p><strong>Conclusions: </strong>We conclude that the PSM paradox is not a legitimate concern and should not stop researchers from using PSM designs.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"25 1","pages":"25"},"PeriodicalIF":3.9,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11776168/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143057962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Artificial intelligence methods applied to longitudinal data from electronic health records for prediction of cancer: a scoping review.
IF 3.9 3区 医学 Q1 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-01-28 DOI: 10.1186/s12874-025-02473-w
Victoria Moglia, Owen Johnson, Gordon Cook, Marc de Kamps, Lesley Smith

Background: Early detection and diagnosis of cancer are vital to improving outcomes for patients. Artificial intelligence (AI) models have shown promise in the early detection and diagnosis of cancer, but there is limited evidence on methods that fully exploit the longitudinal data stored within electronic health records (EHRs). This review aims to summarise methods currently utilised for prediction of cancer from longitudinal data and provides recommendations on how such models should be developed.

Methods: The review was conducted following PRISMA-ScR guidance. Six databases (MEDLINE, EMBASE, Web of Science, IEEE Xplore, PubMed and SCOPUS) were searched for relevant records published before 2/2/2024. Search terms related to the concepts "artificial intelligence", "prediction", "health records", "longitudinal", and "cancer". Data were extracted relating to several areas of the articles: (1) publication details, (2) study characteristics, (3) input data, (4) model characteristics, (4) reproducibility, and (5) quality assessment using the PROBAST tool. Models were evaluated against a framework for terminology relating to reporting of cancer detection and risk prediction models.

Results: Of 653 records screened, 33 were included in the review; 10 predicted risk of cancer, 18 performed either cancer detection or early detection, 4 predicted recurrence, and 1 predicted metastasis. The most common cancers predicted in the studies were colorectal (n = 9) and pancreatic cancer (n = 9). 16 studies used feature engineering to represent temporal data, with the most common features representing trends. 18 used deep learning models which take a direct sequential input, most commonly recurrent neural networks, but also including convolutional neural networks and transformers. Prediction windows and lead times varied greatly between studies, even for models predicting the same cancer. High risk of bias was found in 90% of the studies. This risk was often introduced due to inappropriate study design (n = 26) and sample size (n = 26).

Conclusion: This review highlights the breadth of approaches to cancer prediction from longitudinal data. We identify areas where reporting of methods could be improved, particularly regarding where in a patients' trajectory the model is applied. The review shows opportunities for further work, including comparison of these approaches and their applications in other cancers.

{"title":"Artificial intelligence methods applied to longitudinal data from electronic health records for prediction of cancer: a scoping review.","authors":"Victoria Moglia, Owen Johnson, Gordon Cook, Marc de Kamps, Lesley Smith","doi":"10.1186/s12874-025-02473-w","DOIUrl":"10.1186/s12874-025-02473-w","url":null,"abstract":"<p><strong>Background: </strong>Early detection and diagnosis of cancer are vital to improving outcomes for patients. Artificial intelligence (AI) models have shown promise in the early detection and diagnosis of cancer, but there is limited evidence on methods that fully exploit the longitudinal data stored within electronic health records (EHRs). This review aims to summarise methods currently utilised for prediction of cancer from longitudinal data and provides recommendations on how such models should be developed.</p><p><strong>Methods: </strong>The review was conducted following PRISMA-ScR guidance. Six databases (MEDLINE, EMBASE, Web of Science, IEEE Xplore, PubMed and SCOPUS) were searched for relevant records published before 2/2/2024. Search terms related to the concepts \"artificial intelligence\", \"prediction\", \"health records\", \"longitudinal\", and \"cancer\". Data were extracted relating to several areas of the articles: (1) publication details, (2) study characteristics, (3) input data, (4) model characteristics, (4) reproducibility, and (5) quality assessment using the PROBAST tool. Models were evaluated against a framework for terminology relating to reporting of cancer detection and risk prediction models.</p><p><strong>Results: </strong>Of 653 records screened, 33 were included in the review; 10 predicted risk of cancer, 18 performed either cancer detection or early detection, 4 predicted recurrence, and 1 predicted metastasis. The most common cancers predicted in the studies were colorectal (n = 9) and pancreatic cancer (n = 9). 16 studies used feature engineering to represent temporal data, with the most common features representing trends. 18 used deep learning models which take a direct sequential input, most commonly recurrent neural networks, but also including convolutional neural networks and transformers. Prediction windows and lead times varied greatly between studies, even for models predicting the same cancer. High risk of bias was found in 90% of the studies. This risk was often introduced due to inappropriate study design (n = 26) and sample size (n = 26).</p><p><strong>Conclusion: </strong>This review highlights the breadth of approaches to cancer prediction from longitudinal data. We identify areas where reporting of methods could be improved, particularly regarding where in a patients' trajectory the model is applied. The review shows opportunities for further work, including comparison of these approaches and their applications in other cancers.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"25 1","pages":"24"},"PeriodicalIF":3.9,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11773903/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143057941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scalable information extraction from free text electronic health records using large language models.
IF 3.9 3区 医学 Q1 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-01-28 DOI: 10.1186/s12874-025-02470-z
Bowen Gu, Vivian Shao, Ziqian Liao, Valentina Carducci, Santiago Romero Brufau, Jie Yang, Rishi J Desai

Background: A vast amount of potentially useful information such as description of patient symptoms, family, and social history is recorded as free-text notes in electronic health records (EHRs) but is difficult to reliably extract at scale, limiting their utility in research. This study aims to assess whether an "out of the box" implementation of open-source large language models (LLMs) without any fine-tuning can accurately extract social determinants of health (SDoH) data from free-text clinical notes.

Methods: We conducted a cross-sectional study using EHR data from the Mass General Brigham (MGB) system, analyzing free-text notes for SDoH information. We selected a random sample of 200 patients and manually labeled nine SDoH aspects. Eight advanced open-source LLMs were evaluated against a baseline pattern-matching model. Two human reviewers provided the manual labels, achieving 93% inter-annotator agreement. LLM performance was assessed using accuracy metrics for overall, mentioned, and non-mentioned SDoH, and macro F1 scores.

Results: LLMs outperformed the baseline pattern-matching approach, particularly for explicitly mentioned SDoH, achieving up to 40% higher Accuracymentioned. openchat_3.5 was the best-performing model, surpassing the baseline in overall accuracy across all nine SDoH aspects. The refined pipeline with prompt engineering reduced hallucinations and improved accuracy.

Conclusions: Open-source LLMs are effective and scalable tools for extracting SDoH from unstructured EHRs, surpassing traditional pattern-matching methods. Further refinement and domain-specific training could enhance their utility in clinical research and predictive analytics, improving healthcare outcomes and addressing health disparities.

{"title":"Scalable information extraction from free text electronic health records using large language models.","authors":"Bowen Gu, Vivian Shao, Ziqian Liao, Valentina Carducci, Santiago Romero Brufau, Jie Yang, Rishi J Desai","doi":"10.1186/s12874-025-02470-z","DOIUrl":"10.1186/s12874-025-02470-z","url":null,"abstract":"<p><strong>Background: </strong>A vast amount of potentially useful information such as description of patient symptoms, family, and social history is recorded as free-text notes in electronic health records (EHRs) but is difficult to reliably extract at scale, limiting their utility in research. This study aims to assess whether an \"out of the box\" implementation of open-source large language models (LLMs) without any fine-tuning can accurately extract social determinants of health (SDoH) data from free-text clinical notes.</p><p><strong>Methods: </strong>We conducted a cross-sectional study using EHR data from the Mass General Brigham (MGB) system, analyzing free-text notes for SDoH information. We selected a random sample of 200 patients and manually labeled nine SDoH aspects. Eight advanced open-source LLMs were evaluated against a baseline pattern-matching model. Two human reviewers provided the manual labels, achieving 93% inter-annotator agreement. LLM performance was assessed using accuracy metrics for overall, mentioned, and non-mentioned SDoH, and macro F1 scores.</p><p><strong>Results: </strong>LLMs outperformed the baseline pattern-matching approach, particularly for explicitly mentioned SDoH, achieving up to 40% higher Accuracy<sub>mentioned</sub>. openchat_3.5 was the best-performing model, surpassing the baseline in overall accuracy across all nine SDoH aspects. The refined pipeline with prompt engineering reduced hallucinations and improved accuracy.</p><p><strong>Conclusions: </strong>Open-source LLMs are effective and scalable tools for extracting SDoH from unstructured EHRs, surpassing traditional pattern-matching methods. Further refinement and domain-specific training could enhance their utility in clinical research and predictive analytics, improving healthcare outcomes and addressing health disparities.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"25 1","pages":"23"},"PeriodicalIF":3.9,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11773977/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143051533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Time and cost of linking administrative datasets for outcomes assessment in a follow-up study of participants from two randomised trials.
IF 3.9 3区 医学 Q1 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-01-27 DOI: 10.1186/s12874-025-02458-9
Mohammad Shahbaz, Jane E Harding, Barry Milne, Anthony Walters, Lisa Underwood, Martin von Randow, Lena Jacob, Greg D Gamble

Background: For the follow-up of participants in randomised trials, data linkage is thought a more cost-efficient method for assessing outcomes. However, researchers often encounter technical and budgetary challenges. Data requests often require a significant amount of information from researchers, and can take several years to process. This study aimed to determine the feasibility, direct costs and the total time required to access administrative datasets for assessment of outcomes in a follow-up study of two randomised trials.

Methods: We applied to access administrative datasets from New Zealand government agencies. All actions of study team members, along with their corresponding dates, were recorded prospectively for accessing data from each agency. Team members estimated the average time they spent on each action, and invoices from agencies were recorded. Additionally, we compared the estimated costs and time required for data linkage with those for obtaining self-reported questionnaires and conducting in-person assessments.

Results: Eight agencies were approached to supply data, of which seven gave approval. The time from first enquiry to receiving an initial dataset ranged from 96 to 854 days. For 859 participants, the estimated time required to obtain outcome data from agencies was 1,530 min; to obtain completed self-reported questionnaires was 11,025 min; and to complete in-person assessments was 77,310 min. The estimated total costs were 20,827 NZD for data linkage, 11,735 NZD for self-reported questionnaires, and 116,085 NZD for in-person assessments. Using this data, we estimate that for a cohort of 100 participants, the costs would be similar for data linkage and in-person assessments. For a cohort of 5,000 participants, we estimate that costs would be similar for data linkage and questionnaires, but ten-fold higher for in-person assessments.

Conclusions: Obtaining administrative datasets demands a substantial amount of time and effort. However, data linkage is a feasible method for outcome ascertainment in follow-up studies in New Zealand. For large cohorts, data linkage is likely to be less costly, whereas for small cohorts, in-person assessment has similar costs but is likely to be faster and allows direct assessment of outcomes.

{"title":"Time and cost of linking administrative datasets for outcomes assessment in a follow-up study of participants from two randomised trials.","authors":"Mohammad Shahbaz, Jane E Harding, Barry Milne, Anthony Walters, Lisa Underwood, Martin von Randow, Lena Jacob, Greg D Gamble","doi":"10.1186/s12874-025-02458-9","DOIUrl":"10.1186/s12874-025-02458-9","url":null,"abstract":"<p><strong>Background: </strong>For the follow-up of participants in randomised trials, data linkage is thought a more cost-efficient method for assessing outcomes. However, researchers often encounter technical and budgetary challenges. Data requests often require a significant amount of information from researchers, and can take several years to process. This study aimed to determine the feasibility, direct costs and the total time required to access administrative datasets for assessment of outcomes in a follow-up study of two randomised trials.</p><p><strong>Methods: </strong>We applied to access administrative datasets from New Zealand government agencies. All actions of study team members, along with their corresponding dates, were recorded prospectively for accessing data from each agency. Team members estimated the average time they spent on each action, and invoices from agencies were recorded. Additionally, we compared the estimated costs and time required for data linkage with those for obtaining self-reported questionnaires and conducting in-person assessments.</p><p><strong>Results: </strong>Eight agencies were approached to supply data, of which seven gave approval. The time from first enquiry to receiving an initial dataset ranged from 96 to 854 days. For 859 participants, the estimated time required to obtain outcome data from agencies was 1,530 min; to obtain completed self-reported questionnaires was 11,025 min; and to complete in-person assessments was 77,310 min. The estimated total costs were 20,827 NZD for data linkage, 11,735 NZD for self-reported questionnaires, and 116,085 NZD for in-person assessments. Using this data, we estimate that for a cohort of 100 participants, the costs would be similar for data linkage and in-person assessments. For a cohort of 5,000 participants, we estimate that costs would be similar for data linkage and questionnaires, but ten-fold higher for in-person assessments.</p><p><strong>Conclusions: </strong>Obtaining administrative datasets demands a substantial amount of time and effort. However, data linkage is a feasible method for outcome ascertainment in follow-up studies in New Zealand. For large cohorts, data linkage is likely to be less costly, whereas for small cohorts, in-person assessment has similar costs but is likely to be faster and allows direct assessment of outcomes.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"25 1","pages":"21"},"PeriodicalIF":3.9,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11771019/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143051440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Penalized landmark supermodels (penLM) for dynamic prediction for time-to-event outcomes in high-dimensional data.
IF 3.9 3区 医学 Q1 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-01-27 DOI: 10.1186/s12874-024-02418-9
Anya H Fries, Eunji Choi, Summer S Han

Background: To effectively monitor long-term outcomes among cancer patients, it is critical to accurately assess patients' dynamic prognosis, which often involves utilizing multiple data sources (e.g., tumor registries, treatment histories, and patient-reported outcomes). However, challenges arise in selecting features to predict patient outcomes from high-dimensional data, aligning longitudinal measurements from multiple sources, and evaluating dynamic model performance.

Methods: We provide a framework for dynamic risk prediction using the penalized landmark supermodel (penLM) and develop novel metrics ([Formula: see text] and [Formula: see text]) to evaluate and summarize model performance across different timepoints. Through simulations, we assess the coverage of the proposed metrics' confidence intervals under various scenarios. We applied penLM to predict the updated 5-year risk of lung cancer mortality at diagnosis and for subsequent years by combining data from SEER registries (2007-2018), Medicare claims (2007-2018), Medicare Health Outcome Survey (2006-2018), and U.S. Census (1990-2010).

Results: The simulations confirmed valid coverage (~ 95%) of the confidence intervals of the proposed summary metrics. Of 4,670 lung cancer patients, 41.5% died from lung cancer. Using penLM, the key features to predict lung cancer mortality included long-term lung cancer treatments, minority races, regions with low education attainment or racial segregation, and various patient-reported outcomes beyond cancer staging and tumor characteristics. When evaluated using the proposed metrics, the penLM model developed using multi-source data ([Formula: see text]of 0.77 [95% confidence interval: 0.74-0.79]) outperformed those developed using single-source data ([Formula: see text]range: 0.50-0.74).

Conclusions: The proposed penLM framework with novel evaluation metrics offers effective dynamic risk prediction when leveraging high-dimensional multi-source longitudinal data.

{"title":"Penalized landmark supermodels (penLM) for dynamic prediction for time-to-event outcomes in high-dimensional data.","authors":"Anya H Fries, Eunji Choi, Summer S Han","doi":"10.1186/s12874-024-02418-9","DOIUrl":"10.1186/s12874-024-02418-9","url":null,"abstract":"<p><strong>Background: </strong>To effectively monitor long-term outcomes among cancer patients, it is critical to accurately assess patients' dynamic prognosis, which often involves utilizing multiple data sources (e.g., tumor registries, treatment histories, and patient-reported outcomes). However, challenges arise in selecting features to predict patient outcomes from high-dimensional data, aligning longitudinal measurements from multiple sources, and evaluating dynamic model performance.</p><p><strong>Methods: </strong>We provide a framework for dynamic risk prediction using the penalized landmark supermodel (penLM) and develop novel metrics ([Formula: see text] and [Formula: see text]) to evaluate and summarize model performance across different timepoints. Through simulations, we assess the coverage of the proposed metrics' confidence intervals under various scenarios. We applied penLM to predict the updated 5-year risk of lung cancer mortality at diagnosis and for subsequent years by combining data from SEER registries (2007-2018), Medicare claims (2007-2018), Medicare Health Outcome Survey (2006-2018), and U.S. Census (1990-2010).</p><p><strong>Results: </strong>The simulations confirmed valid coverage (~ 95%) of the confidence intervals of the proposed summary metrics. Of 4,670 lung cancer patients, 41.5% died from lung cancer. Using penLM, the key features to predict lung cancer mortality included long-term lung cancer treatments, minority races, regions with low education attainment or racial segregation, and various patient-reported outcomes beyond cancer staging and tumor characteristics. When evaluated using the proposed metrics, the penLM model developed using multi-source data ([Formula: see text]of 0.77 [95% confidence interval: 0.74-0.79]) outperformed those developed using single-source data ([Formula: see text]range: 0.50-0.74).</p><p><strong>Conclusions: </strong>The proposed penLM framework with novel evaluation metrics offers effective dynamic risk prediction when leveraging high-dimensional multi-source longitudinal data.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"25 1","pages":"22"},"PeriodicalIF":3.9,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11771018/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143051531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SQUARE-IT: a proposed approach to square the identified research problem in the literature with the objectives, the appropriate clinical research question, and the research hypothesis.
IF 3.9 3区 医学 Q1 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-01-27 DOI: 10.1186/s12874-025-02468-7
Martin Alfuth, Jonas Klemp, Annette Schmidt, Lukas Streese, Nikolai Ramadanov, Robert Prill

The purpose of this article is to design and introduce the SQUARE-IT approach to help scientists and clinicians in research to align important research problems with the objectives, the appropriate clinical research questions to be answered, and the research hypotheses to be investigated in medical and therapeutic specialties. Research ideas can be generated primarily through simple methods such as brainstorming and mind mapping. However, transforming ideas into a valid research question is not as easy as it may seem, as the mere presence of an idea does not guarantee that the researcher has already uncovered existing knowledge on a particular topic or identified the actual research problem. Therefore, the SQUARE-IT items are developed, described, and critically discussed with reference to the scientific literature. They ask whether the identified research problem is 'Specific', 'Quantifiable', 'Usable', 'Accurate', 'Restricted', 'Eligible', 'Investigable', and 'Timely'. Before formulating the focused clinical question, SQUARE-IT can be used as a preparatory step to enable researchers to organize the relevant information that has been explored to date and to assess whether additional information is needed, thereby identifying current research gaps. In addition, it should facilitate the effectiveness and efficiency of evidence-based practice to ensure high quality patient care. Using SQUARE-IT as a framework, further elaboration of the approach and addition of other aspects are warranted to advance the discussion and improve methods of evidence-based practice in medical and therapeutic specialties for quality improvement of patient care.

{"title":"SQUARE-IT: a proposed approach to square the identified research problem in the literature with the objectives, the appropriate clinical research question, and the research hypothesis.","authors":"Martin Alfuth, Jonas Klemp, Annette Schmidt, Lukas Streese, Nikolai Ramadanov, Robert Prill","doi":"10.1186/s12874-025-02468-7","DOIUrl":"10.1186/s12874-025-02468-7","url":null,"abstract":"<p><p>The purpose of this article is to design and introduce the SQUARE-IT approach to help scientists and clinicians in research to align important research problems with the objectives, the appropriate clinical research questions to be answered, and the research hypotheses to be investigated in medical and therapeutic specialties. Research ideas can be generated primarily through simple methods such as brainstorming and mind mapping. However, transforming ideas into a valid research question is not as easy as it may seem, as the mere presence of an idea does not guarantee that the researcher has already uncovered existing knowledge on a particular topic or identified the actual research problem. Therefore, the SQUARE-IT items are developed, described, and critically discussed with reference to the scientific literature. They ask whether the identified research problem is 'Specific', 'Quantifiable', 'Usable', 'Accurate', 'Restricted', 'Eligible', 'Investigable', and 'Timely'. Before formulating the focused clinical question, SQUARE-IT can be used as a preparatory step to enable researchers to organize the relevant information that has been explored to date and to assess whether additional information is needed, thereby identifying current research gaps. In addition, it should facilitate the effectiveness and efficiency of evidence-based practice to ensure high quality patient care. Using SQUARE-IT as a framework, further elaboration of the approach and addition of other aspects are warranted to advance the discussion and improve methods of evidence-based practice in medical and therapeutic specialties for quality improvement of patient care.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"25 1","pages":"19"},"PeriodicalIF":3.9,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11770966/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143051437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application of GUHA data mining method in cohort data to explore paths associated with premature death: a 29-year follow-up study.
IF 3.9 3区 医学 Q1 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-01-27 DOI: 10.1186/s12874-025-02477-6
Lily Nosraty, Esko Turunen, Saila Kyrönlahti, Clas-Håkan Nygård, Prakash Kc, Subas Neupane

Background and method: This study set out to identify the factors and combinations of factors associated with the individual's premature death, using data from the Finnish Longitudinal Study on Ageing Municipal Employees (FLAME) which involved 6,257 participants over a 29-year follow-up period. Exact dates of death were obtained from the Finnish population register. Premature death was defined as a death occurring earlier than the age- and sex-specific actuarial life expectancy indicated by life tables for 1981, as the baseline, with the threshold period of nine months. Explanatory variables encompassed sociodemographic characteristics, health and functioning, health behaviors, subjective experiences, working conditions, and work abilities. Data were mined using the General Unary Hypothesis Automaton (GUHA) method, implemented with LISp-Miner software. GUHA involves an active dialogue between the user and the LISp-Miner software, with parameters tailored to the data and user interests. The parameters used are not absolute but depend on the data to be mined and the user's interests.

Results: Over the follow-up period, 2,196 deaths were recorded, of which 70.4% were premature. Seven single factors and 67 sets of criteria (paths) were statistically significantly associated with premature mortality, passing the one-sided Fisher test. Single predicates of premature death included smoking, consuming alcohol a few times a month or once a week, poor self-rated fitness, incompetence to work and poor assured workability in two years' time, and diseases causing work disability. Notably, most of the factors selected as single predicates of premature mortality did not appear in the multi-predicate paths. Factors appearing in the paths were smoking more than 20 cigarettes a day, symptoms that impaired functioning, past smoking, absence of musculoskeletal diseases, poor self-rated health, having pain, male sex, being married, use of medication, more physical strain compared to others, and high life satisfaction, intention to retire due to reduced work ability caused by diseases and demanding work. Sex-specific analysis revealed similar findings.

Conclusion: The findings indicate that associations between single predictors and premature mortality should be interpreted with caution, even when adjusted for a limited number of other factors. This highlights the complexity of premature mortality and the need for comprehensive models considering multiple interacting factors.

{"title":"Application of GUHA data mining method in cohort data to explore paths associated with premature death: a 29-year follow-up study.","authors":"Lily Nosraty, Esko Turunen, Saila Kyrönlahti, Clas-Håkan Nygård, Prakash Kc, Subas Neupane","doi":"10.1186/s12874-025-02477-6","DOIUrl":"10.1186/s12874-025-02477-6","url":null,"abstract":"<p><strong>Background and method: </strong>This study set out to identify the factors and combinations of factors associated with the individual's premature death, using data from the Finnish Longitudinal Study on Ageing Municipal Employees (FLAME) which involved 6,257 participants over a 29-year follow-up period. Exact dates of death were obtained from the Finnish population register. Premature death was defined as a death occurring earlier than the age- and sex-specific actuarial life expectancy indicated by life tables for 1981, as the baseline, with the threshold period of nine months. Explanatory variables encompassed sociodemographic characteristics, health and functioning, health behaviors, subjective experiences, working conditions, and work abilities. Data were mined using the General Unary Hypothesis Automaton (GUHA) method, implemented with LISp-Miner software. GUHA involves an active dialogue between the user and the LISp-Miner software, with parameters tailored to the data and user interests. The parameters used are not absolute but depend on the data to be mined and the user's interests.</p><p><strong>Results: </strong>Over the follow-up period, 2,196 deaths were recorded, of which 70.4% were premature. Seven single factors and 67 sets of criteria (paths) were statistically significantly associated with premature mortality, passing the one-sided Fisher test. Single predicates of premature death included smoking, consuming alcohol a few times a month or once a week, poor self-rated fitness, incompetence to work and poor assured workability in two years' time, and diseases causing work disability. Notably, most of the factors selected as single predicates of premature mortality did not appear in the multi-predicate paths. Factors appearing in the paths were smoking more than 20 cigarettes a day, symptoms that impaired functioning, past smoking, absence of musculoskeletal diseases, poor self-rated health, having pain, male sex, being married, use of medication, more physical strain compared to others, and high life satisfaction, intention to retire due to reduced work ability caused by diseases and demanding work. Sex-specific analysis revealed similar findings.</p><p><strong>Conclusion: </strong>The findings indicate that associations between single predictors and premature mortality should be interpreted with caution, even when adjusted for a limited number of other factors. This highlights the complexity of premature mortality and the need for comprehensive models considering multiple interacting factors.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"25 1","pages":"20"},"PeriodicalIF":3.9,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11771032/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143051529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Construction of the cancer patients' database based on the US National Health and Nutrition Examination Survey (NHANES) datasets for cancer epidemiology research.
IF 3.9 3区 医学 Q1 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-01-24 DOI: 10.1186/s12874-025-02478-5
Jinyoung Moon, Yongseok Mun

Background: The US National Health and Nutrition Examination Survey (NHANES) dataset does not include a specific question or laboratory test to confirm a history of cancer diagnosis. However, if straightforward variables for cancer history are introduced, US NHANES could be effectively utilized in future cancer epidemiology studies. To address this gap, the authors developed a cancer patient database from the US NHANES datasets by employing multiple R programming codes.

Methods: To illustrate the practical application of this methodology to a real-world problem, the authors extracted the R codes applied in an academic paper published in another journal on January 30th, 2024 ( https://doi.org/10.1016/j.heliyon.2024.e24337 ). This paper will focus on the construction of the database and analysis using R codes. Entire.

Results: In the first example, the urine concentration of monocarboxynonyl phthalate, monocarboxyoctyl phthalate, mono-2-ethyl-5-carboxypentyl phthalate, and mono-2-hydroxy-iso-butyl phthalate (all ng/mL) were used as the independent variable, instead of the serum concentration of perfluorooctanoic acid (PFOA), perfluorooctane sulfonic acid (PFOS), perfluorohexane sulfonic acid (PFHxS), and perfluorononanoic acid (PFNA), respectively. In the second example, the serum concentration of 2,3,3',4,4'-Pentachlorobiphenyl (PCB105), 2,3,4,4´,5-Pentachlorobiphenyl (PCB114), 2,3',4,4',5-Pentachlorobiphenyl (PCB118), and 2,2',3,4,4',5'- and 2,3,3',4,4',6-Hexachlorobiphenyl (PCB138) were used as the independent variable, instead of the serum concentration of PFOA, PFOS, PFHxS, and PFNA, respectively.

Discussion: This research offers a comprehensive set of R codes aimed at creating a single, user-friendly variable that encapsulates the history of each type of cancer while also considering the age at which the diagnosis was made. The US NHANES provides a wealth of critical data on environmental toxicant exposures. By employing these R codes, researchers can potentially discover numerous new associations between environmental toxicant exposures and cancer diagnoses. Ultimately, these codes could significantly advance the field of cancer epidemiology in relation to environmental toxicant exposure.

{"title":"Construction of the cancer patients' database based on the US National Health and Nutrition Examination Survey (NHANES) datasets for cancer epidemiology research.","authors":"Jinyoung Moon, Yongseok Mun","doi":"10.1186/s12874-025-02478-5","DOIUrl":"10.1186/s12874-025-02478-5","url":null,"abstract":"<p><strong>Background: </strong>The US National Health and Nutrition Examination Survey (NHANES) dataset does not include a specific question or laboratory test to confirm a history of cancer diagnosis. However, if straightforward variables for cancer history are introduced, US NHANES could be effectively utilized in future cancer epidemiology studies. To address this gap, the authors developed a cancer patient database from the US NHANES datasets by employing multiple R programming codes.</p><p><strong>Methods: </strong>To illustrate the practical application of this methodology to a real-world problem, the authors extracted the R codes applied in an academic paper published in another journal on January 30th, 2024 ( https://doi.org/10.1016/j.heliyon.2024.e24337 ). This paper will focus on the construction of the database and analysis using R codes. Entire.</p><p><strong>Results: </strong>In the first example, the urine concentration of monocarboxynonyl phthalate, monocarboxyoctyl phthalate, mono-2-ethyl-5-carboxypentyl phthalate, and mono-2-hydroxy-iso-butyl phthalate (all ng/mL) were used as the independent variable, instead of the serum concentration of perfluorooctanoic acid (PFOA), perfluorooctane sulfonic acid (PFOS), perfluorohexane sulfonic acid (PFHxS), and perfluorononanoic acid (PFNA), respectively. In the second example, the serum concentration of 2,3,3',4,4'-Pentachlorobiphenyl (PCB105), 2,3,4,4´,5-Pentachlorobiphenyl (PCB114), 2,3',4,4',5-Pentachlorobiphenyl (PCB118), and 2,2',3,4,4',5'- and 2,3,3',4,4',6-Hexachlorobiphenyl (PCB138) were used as the independent variable, instead of the serum concentration of PFOA, PFOS, PFHxS, and PFNA, respectively.</p><p><strong>Discussion: </strong>This research offers a comprehensive set of R codes aimed at creating a single, user-friendly variable that encapsulates the history of each type of cancer while also considering the age at which the diagnosis was made. The US NHANES provides a wealth of critical data on environmental toxicant exposures. By employing these R codes, researchers can potentially discover numerous new associations between environmental toxicant exposures and cancer diagnoses. Ultimately, these codes could significantly advance the field of cancer epidemiology in relation to environmental toxicant exposure.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"25 1","pages":"17"},"PeriodicalIF":3.9,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11758729/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143036859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
BMC Medical Research Methodology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1