Pub Date : 2025-05-20eCollection Date: 2023-01-01DOI: 10.23889/ijpds.v8i2.2927
Laura Scott, Yan Weigang, Marcella Ucci, Jessica Sheringham
Background: This project in one urban local authority in London (England) sought to assess the feasibility of generating locally-derived indices of overcrowding using data available to local councils on the population and their homes.We merged data at household level using the Unique Property Reference Number from publicly available Energy Performance Certificates and commercial property platforms, with data available to councils on the population and their housing characteristics, drawn from multiple sources including council tax bands and council housing databases. Multiple imputation was used to address missing data. Using the dataset, it was possible to generate two indices of overcrowding for households with dependent children, based on the bedroom standard and the space standard, which could be compared with nationally derived estimates.
Data challenges: We encountered three challenges with data. 1. Individuals in the population were excluded through linkage with household-level data. 2. Definitions of overcrowding are ambiguous and variably applied. 3. Many local areas face high proportions of missing household data, particularly numbers of bedrooms. We discuss how we addressed such problems and illustrate with a local example how they could affect estimates of overcrowding prevalence.
Lessons learned: Further clarity is needed in how bedrooms are defined to compare overcrowding prevalence generated locally and nationally. Access to national records on bedroom numbers would facilitate local areas to identify overcrowding in their own populations. Despite these challenges, we demonstrate it is feasible to generate overcrowding indices that can be useful for researchers and local policy makers seeking to develop or evaluate strategies to address household overcrowding.
{"title":"<i>Data Note</i>: Challenges when combining housing data from multiple sources to identify overcrowded households.","authors":"Laura Scott, Yan Weigang, Marcella Ucci, Jessica Sheringham","doi":"10.23889/ijpds.v8i2.2927","DOIUrl":"10.23889/ijpds.v8i2.2927","url":null,"abstract":"<p><strong>Background: </strong>This project in one urban local authority in London (England) sought to assess the feasibility of generating locally-derived indices of overcrowding using data available to local councils on the population and their homes.We merged data at household level using the Unique Property Reference Number from publicly available Energy Performance Certificates and commercial property platforms, with data available to councils on the population and their housing characteristics, drawn from multiple sources including council tax bands and council housing databases. Multiple imputation was used to address missing data. Using the dataset, it was possible to generate two indices of overcrowding for households with dependent children, based on the bedroom standard and the space standard, which could be compared with nationally derived estimates.</p><p><strong>Data challenges: </strong>We encountered three challenges with data. 1. Individuals in the population were excluded through linkage with household-level data. 2. Definitions of overcrowding are ambiguous and variably applied. 3. Many local areas face high proportions of missing household data, particularly numbers of bedrooms. We discuss how we addressed such problems and illustrate with a local example how they could affect estimates of overcrowding prevalence.</p><p><strong>Lessons learned: </strong>Further clarity is needed in how bedrooms are defined to compare overcrowding prevalence generated locally and nationally. Access to national records on bedroom numbers would facilitate local areas to identify overcrowding in their own populations. Despite these challenges, we demonstrate it is feasible to generate overcrowding indices that can be useful for researchers and local policy makers seeking to develop or evaluate strategies to address household overcrowding.</p>","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":"8 2","pages":"2927"},"PeriodicalIF":1.6,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12093136/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144119973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-13eCollection Date: 2025-01-01DOI: 10.23889/ijpds.v10i2.2943
Matthew A Jay, Kate Lewis, Difei Shi, Rebecca Langella, Tony Stone, Sorcha Ní Chobhthaigh, Ania Zylbersztejn, Ruth Blackburn, Katie Harron
Administrative health data, such as the Hospital Episode Statistics (HES), can be used to identify groups of people with a particular target condition, a process known as phenotyping. Clinical phenotypes are useful as exposures, covariates and outcomes in research studies using administrative data, including health data linked to other sources such as the Education and Child Health Insights from Linked Data (ECHILD) project. ECHILD brings together HES and other national health datasets with the National Pupil Database and children's social care data for all of England as a data asset that can be accessed by researchers at UK institutions. Because using linked administrative data is complex, the ECHILD team has created additional resources to improve the accessibility of ECHILD. One such initiative is the ECHILD Phenotype Code List Repository. The Repository is a fully open and searchable website containing phenotype code lists that can be used in ECHILD and beyond. As well as a primer on phenotyping, it includes summaries of each code list and R and Stata implementation scripts. The Repository was designed according to a set of principles to ensure that finding and using code lists is easy and standardised. The ECHILD Phenotype Code List Repository is a step forward in the findability and use of phenotype code lists in ECHILD and its constituent datasets.
{"title":"Open science and phenotyping in UK administrative health, education and social care data: the ECHILD phenotype code list repository.","authors":"Matthew A Jay, Kate Lewis, Difei Shi, Rebecca Langella, Tony Stone, Sorcha Ní Chobhthaigh, Ania Zylbersztejn, Ruth Blackburn, Katie Harron","doi":"10.23889/ijpds.v10i2.2943","DOIUrl":"https://doi.org/10.23889/ijpds.v10i2.2943","url":null,"abstract":"<p><p>Administrative health data, such as the Hospital Episode Statistics (HES), can be used to identify groups of people with a particular target condition, a process known as phenotyping. Clinical phenotypes are useful as exposures, covariates and outcomes in research studies using administrative data, including health data linked to other sources such as the Education and Child Health Insights from Linked Data (ECHILD) project. ECHILD brings together HES and other national health datasets with the National Pupil Database and children's social care data for all of England as a data asset that can be accessed by researchers at UK institutions. Because using linked administrative data is complex, the ECHILD team has created additional resources to improve the accessibility of ECHILD. One such initiative is the ECHILD Phenotype Code List Repository. The Repository is a fully open and searchable website containing phenotype code lists that can be used in ECHILD and beyond. As well as a primer on phenotyping, it includes summaries of each code list and R and Stata implementation scripts. The Repository was designed according to a set of principles to ensure that finding and using code lists is easy and standardised. The ECHILD Phenotype Code List Repository is a step forward in the findability and use of phenotype code lists in ECHILD and its constituent datasets.</p>","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":"10 2","pages":"2943"},"PeriodicalIF":1.6,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12076273/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144079932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-07eCollection Date: 2025-01-01DOI: 10.23889/ijpds.v10i1.2703
Max C Keuken, Jizzo R Bosdriesz, Anders Boyd, Elisabeth M den Boogert, Ivo K Joore, Nicole H T M Dukers-Muijrers, Gini van Rijckevorsel, Hannelore M Götz, Irene E Goverse, Mariska W F Petrignani, Stijn F H Raven, Susan van den Hof, Kirsten V C Wevers-de Boer, Maarten F Schim van der Loeff, Amy Matser
Source and contact tracing (SCT) is a core public health measure that is used to contain the spread of infectious diseases. It aims to identify a source of infection, and to advise those who have been exposed to this source. Due to the rapid increases in incidence of COVID-19 in the Netherlands, the capacity to conduct a full SCT quickly became insufficient. Therefore, the public health services (PHS) might benefit from a restricted strategy targeted to geographical regions where (predicted) case-to-case transmission is high. In this study, we set out to develop a prediction model for the number of COVID-19 cases per postal code within the Netherlands using geographic and demographic features. The study population consists of individuals residing in one of the participating nine Dutch PHS regions who tested positive for SARS-CoV-2 between 1 June 2020 and 27 February 2021. Using a machine learning random forest regression model, we predicted the top 100 postal codes with the highest number of cases with an accuracy of 49% for the current week, 42% for next week, and 44% for two weeks from present. In addition, the age groups of 20-39 and 40-64 years had a higher prediction accuracy than groups outside these age ranges. The developed model provides a starting point for targeted preventive SCT efforts that incorporate geospatial and demographic characteristics of a neighbourhood. It should nonetheless be noted that during the early stages of the outbreak, the number of available datapoints needed to inform such models are likely insufficient. Given the accuracy and data requirements of the developed model, it is unlikely that this class of models can play a pivotal role in informing policy during the early phases of a future epidemic.
{"title":"Spatio-temporal forecasting of COVID-19 cases in the Netherlands for source and contact tracing.","authors":"Max C Keuken, Jizzo R Bosdriesz, Anders Boyd, Elisabeth M den Boogert, Ivo K Joore, Nicole H T M Dukers-Muijrers, Gini van Rijckevorsel, Hannelore M Götz, Irene E Goverse, Mariska W F Petrignani, Stijn F H Raven, Susan van den Hof, Kirsten V C Wevers-de Boer, Maarten F Schim van der Loeff, Amy Matser","doi":"10.23889/ijpds.v10i1.2703","DOIUrl":"https://doi.org/10.23889/ijpds.v10i1.2703","url":null,"abstract":"<p><p>Source and contact tracing (SCT) is a core public health measure that is used to contain the spread of infectious diseases. It aims to identify a source of infection, and to advise those who have been exposed to this source. Due to the rapid increases in incidence of COVID-19 in the Netherlands, the capacity to conduct a full SCT quickly became insufficient. Therefore, the public health services (PHS) might benefit from a restricted strategy targeted to geographical regions where (predicted) case-to-case transmission is high. In this study, we set out to develop a prediction model for the number of COVID-19 cases per postal code within the Netherlands using geographic and demographic features. The study population consists of individuals residing in one of the participating nine Dutch PHS regions who tested positive for SARS-CoV-2 between 1 June 2020 and 27 February 2021. Using a machine learning random forest regression model, we predicted the top 100 postal codes with the highest number of cases with an accuracy of 49% for the current week, 42% for next week, and 44% for two weeks from present. In addition, the age groups of 20-39 and 40-64 years had a higher prediction accuracy than groups outside these age ranges. The developed model provides a starting point for targeted preventive SCT efforts that incorporate geospatial and demographic characteristics of a neighbourhood. It should nonetheless be noted that during the early stages of the outbreak, the number of available datapoints needed to inform such models are likely insufficient. Given the accuracy and data requirements of the developed model, it is unlikely that this class of models can play a pivotal role in informing policy during the early phases of a future epidemic.</p>","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":"10 1","pages":"2703"},"PeriodicalIF":1.6,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12058245/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144040266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-30eCollection Date: 2025-01-01DOI: 10.23889/ijpds.v10i1.2461
Kate M Miller, Felicity S Flack, Merran B Smith, Vicki Bennett, Carina Ecremen Marshall
Background: Metadata plays a crucial role in the health research infrastructure ecosystem. Despite the abundance of metadata for data collections in Australia, the vast and diverse data custodian landscape poses challenges for linked data researchers to find relevant information for multiple data collections, often making it an arduous and time-intensive task.
Methods: The project comprised three phases: an initial scoping exercise to understand the current state of metadata and related best practice; a national consultation involving researchers, data linkage staff and data custodians to develop a high-fidelity prototype of a metadata platform; and a final build and implementation phase. The platform underwent several prototyping and testing cycles to refine the digital experience.
Results: Expert interviews confirmed that there is a wealth of metadata available, but it is difficult for researchers to access and evaluate. Consultations with researchers identified opportunities to standardise metadata across collections and provide a centralised platform to enhance the discoverability of data collections for research using linked data. High value platform features included searching, browsing and filtering capabilities, data item list metadata, standardised formats, sample data, and frequently asked questions. The final design and functionality reflected user consultations and data custodian input on feasibility.
Conclusion: The Population Health Research Network developed a metadata platform to enable researchers to evaluate the suitability of Australian data collections for linked data projects more effectively. The platform has standardised the way in which metadata is presented for data collections nationally. Improved metadata quality, readability and accessibility will save time and enhance the quality of applications for linked data.
{"title":"Discovering linked data collections through a new national metadata platform.","authors":"Kate M Miller, Felicity S Flack, Merran B Smith, Vicki Bennett, Carina Ecremen Marshall","doi":"10.23889/ijpds.v10i1.2461","DOIUrl":"https://doi.org/10.23889/ijpds.v10i1.2461","url":null,"abstract":"<p><strong>Background: </strong>Metadata plays a crucial role in the health research infrastructure ecosystem. Despite the abundance of metadata for data collections in Australia, the vast and diverse data custodian landscape poses challenges for linked data researchers to find relevant information for multiple data collections, often making it an arduous and time-intensive task.</p><p><strong>Methods: </strong>The project comprised three phases: an initial scoping exercise to understand the current state of metadata and related best practice; a national consultation involving researchers, data linkage staff and data custodians to develop a high-fidelity prototype of a metadata platform; and a final build and implementation phase. The platform underwent several prototyping and testing cycles to refine the digital experience.</p><p><strong>Results: </strong>Expert interviews confirmed that there is a wealth of metadata available, but it is difficult for researchers to access and evaluate. Consultations with researchers identified opportunities to standardise metadata across collections and provide a centralised platform to enhance the discoverability of data collections for research using linked data. High value platform features included searching, browsing and filtering capabilities, data item list metadata, standardised formats, sample data, and frequently asked questions. The final design and functionality reflected user consultations and data custodian input on feasibility.</p><p><strong>Conclusion: </strong>The Population Health Research Network developed a metadata platform to enable researchers to evaluate the suitability of Australian data collections for linked data projects more effectively. The platform has standardised the way in which metadata is presented for data collections nationally. Improved metadata quality, readability and accessibility will save time and enhance the quality of applications for linked data.</p>","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":"10 1","pages":"2461"},"PeriodicalIF":1.6,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12042732/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144020126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-10eCollection Date: 2025-01-01DOI: 10.23889/ijpds.v10i2.2945
Amelia Jewell, Matthew Broadbent, Claire Delaney-Pope, Megan Pritchard, Hannah Woods, Robert Stewart
Background: Transparency in the use of routinely collected mental health data for research is essential in maintaining public support and trust, as well as for supporting the sharing of information and data resources amongst the academic community. The National Institute for Health and Care Research (NIHR) Maudsley Biomedical Research Centre (BRC) Clinical Records Interactive Search (CRIS) enables a case register of deidentified mental health records from the South London and Maudsley NHS Foundation Trust (SLaM). CRIS supports mental health research across the lifespan from children and adolescents to older adults.
Aim: This paper aims to describe the activities which contribute to ensuring that transparency is maintained throughout the journey of data in CRIS: from data collection, through application in research, to dissemination of findings.
Approach: A communications plan is in place to support Patient and Public Involvement (PPI) and transparency initiatives for all CRIS stakeholders, including patients and carers, academic users, and the general public. Activities can be divided into three categories of transparency: existence, use, and output.
Discussion: There are challenges to maintaining transparency, including ensuring that activities are varied enough to reach all stakeholders, including harder to reach groups, and presenting information in a way that is appropriate for the relevant audience. However, greater transparency has led to more opportunities for researchers to engage with patients and the CRIS model is widely accepted by patients.
Conclusion: This paper set out to describe CRIS communications and transparency activities. We believe the material covered will be of interest to other providers of routinely collected data for research.
{"title":"Transparency in the existence, use, and output of a mental health data resource: a descriptive paper from the National Institute for Health and Care Research (NIHR) Maudsley Biomedical Research Centre (BRC) Clinical Record Interactive Search (CRIS) Platform.","authors":"Amelia Jewell, Matthew Broadbent, Claire Delaney-Pope, Megan Pritchard, Hannah Woods, Robert Stewart","doi":"10.23889/ijpds.v10i2.2945","DOIUrl":"https://doi.org/10.23889/ijpds.v10i2.2945","url":null,"abstract":"<p><strong>Background: </strong>Transparency in the use of routinely collected mental health data for research is essential in maintaining public support and trust, as well as for supporting the sharing of information and data resources amongst the academic community. The National Institute for Health and Care Research (NIHR) Maudsley Biomedical Research Centre (BRC) Clinical Records Interactive Search (CRIS) enables a case register of deidentified mental health records from the South London and Maudsley NHS Foundation Trust (SLaM). CRIS supports mental health research across the lifespan from children and adolescents to older adults.</p><p><strong>Aim: </strong>This paper aims to describe the activities which contribute to ensuring that transparency is maintained throughout the journey of data in CRIS: from data collection, through application in research, to dissemination of findings.</p><p><strong>Approach: </strong>A communications plan is in place to support Patient and Public Involvement (PPI) and transparency initiatives for all CRIS stakeholders, including patients and carers, academic users, and the general public. Activities can be divided into three categories of transparency: existence, use, and output.</p><p><strong>Discussion: </strong>There are challenges to maintaining transparency, including ensuring that activities are varied enough to reach all stakeholders, including harder to reach groups, and presenting information in a way that is appropriate for the relevant audience. However, greater transparency has led to more opportunities for researchers to engage with patients and the CRIS model is widely accepted by patients.</p><p><strong>Conclusion: </strong>This paper set out to describe CRIS communications and transparency activities. We believe the material covered will be of interest to other providers of routinely collected data for research.</p>","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":"10 1","pages":"2945"},"PeriodicalIF":1.6,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12076277/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144079829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-25eCollection Date: 2023-01-01DOI: 10.23889/ijpds.v8i6.2953
Shivani Sickotra
Introduction: Sequence analysis is a powerful methodology for examining longitudinal school-to-work trajectories. Despite its growing use, there is limited guidance on preparing suitable datasets. This resource details the creation of a dataset specifically designed for sequence analysis, capturing yearly education and employment activity states for 556,182 individuals from England's 2010/11 school-leaver cohort.
Methods: The dataset was constructed using the Department for Education's Longitudinal Education Outcomes (LEO) data. SQL was used to extract relevant variables, and data linkage and preprocessing was performed using R. Data processing was tailored to sequence analysis, including reducing the number of activity states and applying a hierarchy to integrate education and employment data.
Results: The resulting dataset spans activities from the first non-compulsory state in 2011/12 until 2018/19, tracking trajectories from ages 16/17 to 23/24. The dataset was designed with the ability to subset school-leavers by their initial Combined Authority residence to aid in regional analysis of school-to-work trajectories. Individual-level socio-demographic characteristics that can be linked to the longitudinal activity histories were also built, alongside longitudinal geographic locations and employment earnings data. Additionally, the limitations of the developed data are discussed.
Conclusion: This resource provides crucial guidance for researchers and practitioners who may require experience preparing input datasets for sequence analysis, addressing the current gap in available resources. By offering step-by-step instructions and shared code, it empowers users to recreate or adapt the dataset for their specific research needs. Its ability to subset by region further supports localised and comparative studies of school-to-work trajectories, making it a valuable tool for advancing existing research. The LEO data can be accessed by application through the Office for National Statistics Secure Research Service.
{"title":"Data resource profile: a guide for constructing school-to-work sequence analysis trajectories using the longitudinal education outcomes (LEO) data.","authors":"Shivani Sickotra","doi":"10.23889/ijpds.v8i6.2953","DOIUrl":"10.23889/ijpds.v8i6.2953","url":null,"abstract":"<p><strong>Introduction: </strong>Sequence analysis is a powerful methodology for examining longitudinal school-to-work trajectories. Despite its growing use, there is limited guidance on preparing suitable datasets. This resource details the creation of a dataset specifically designed for sequence analysis, capturing yearly education and employment activity states for 556,182 individuals from England's 2010/11 school-leaver cohort.</p><p><strong>Methods: </strong>The dataset was constructed using the Department for Education's Longitudinal Education Outcomes (LEO) data. SQL was used to extract relevant variables, and data linkage and preprocessing was performed using R. Data processing was tailored to sequence analysis, including reducing the number of activity states and applying a hierarchy to integrate education and employment data.</p><p><strong>Results: </strong>The resulting dataset spans activities from the first non-compulsory state in 2011/12 until 2018/19, tracking trajectories from ages 16/17 to 23/24. The dataset was designed with the ability to subset school-leavers by their initial Combined Authority residence to aid in regional analysis of school-to-work trajectories. Individual-level socio-demographic characteristics that can be linked to the longitudinal activity histories were also built, alongside longitudinal geographic locations and employment earnings data. Additionally, the limitations of the developed data are discussed.</p><p><strong>Conclusion: </strong>This resource provides crucial guidance for researchers and practitioners who may require experience preparing input datasets for sequence analysis, addressing the current gap in available resources. By offering step-by-step instructions and shared code, it empowers users to recreate or adapt the dataset for their specific research needs. Its ability to subset by region further supports localised and comparative studies of school-to-work trajectories, making it a valuable tool for advancing existing research. The LEO data can be accessed by application through the Office for National Statistics Secure Research Service.</p>","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":"8 6","pages":"2953"},"PeriodicalIF":1.6,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11935648/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143711548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-18eCollection Date: 2025-01-01DOI: 10.23889/ijpds.v10i1.2413
Caitlin Gray, Helen Leonard, Matthew N Cooper, Dheeraj Rai, Emma J Glasson
Introduction: Siblings of children with neurodevelopmental conditions have unique experiences and challenges related to their sibling role. Some develop mental health concerns as measured by self-reported surveys or parent report. Few data are available at the population level, owing to difficulties capturing wide-scale health data for siblings. Data linkage is a technique that can facilitate such research.
Objective: To explore the application of population data linkage as a research method to capture health outcomes of siblings of children with neurodevelopmental conditions.
Inclusion criteria: Peer reviewed papers that captured health outcomes for siblings of children and young adults with neurodevelopmental conditions using population data linkage.
Methods: JBI Scoping review methods were followed. Papers were searched within CINAHL, Ovid, Scopus, and Web of Science from 2000 to 2024 using search terms relating to 'data linkage' 'neurodevelopmental conditions' 'siblings' and 'health outcomes'.
Results: The final data extraction included 31 papers. The neurodevelopmental conditions of index children were autism, attention deficit hyperactivity disorder, intellectual disability, cerebral palsy and developmental delay. The mean follow-up time was 31 years, and the majority of studies originated from Scandinavia. Sibling health outcomes observed were psychiatric diagnoses, self-harm and suicide, other neurodevelopmental conditions, and medical conditions such as atopic disease, cancer and obesity.
Conclusion: Data linkage can help capture sibling health outcomes quickly across large cohorts with a range of neurodevelopmental conditions. Future research could be enhanced by focusing on siblings as the primary group of interest, increased integration of genealogical data, and comparisons between diagnostic groups and severity levels. Adoption of established rigorous reporting methods will increase the replicability of this type of research, and provide a stronger evidence-base from which to inform sibling supports.
儿童的兄弟姐妹与神经发育条件有独特的经验和挑战相关的兄弟姐妹的角色。根据自我报告的调查或家长报告,一些人出现了心理健康问题。由于难以获得兄弟姐妹的大规模健康数据,人口一级的数据很少。数据链接是一种可以促进这种研究的技术。目的:探讨将人口数据联动作为一种研究方法,捕捉神经发育障碍儿童兄弟姐妹的健康状况。纳入标准:同行评议的论文,利用人口数据链接捕获患有神经发育疾病的儿童和年轻人的兄弟姐妹的健康结果。方法:采用JBI范围审查方法。在2000年至2024年期间,在CINAHL、Ovid、Scopus和Web of Science中检索了与“数据链接”、“神经发育状况”、“兄弟姐妹”和“健康结果”相关的搜索词。结果:最终数据提取包括31篇论文。指数儿童的神经发育状况为自闭症、注意缺陷多动障碍、智力障碍、脑瘫和发育迟缓。平均随访时间为31年,大多数研究来自斯堪的纳维亚半岛。观察到的兄弟姐妹健康结果包括精神诊断、自残和自杀、其他神经发育状况,以及特应性疾病、癌症和肥胖等医疗状况。结论:数据链接可以帮助在具有一系列神经发育条件的大型队列中快速捕获兄弟姐妹的健康结果。未来的研究可以通过关注兄弟姐妹作为主要关注群体,增加家谱数据的整合以及诊断组和严重程度之间的比较来加强。采用既定的严格报告方法将增加这类研究的可复制性,并提供更有力的证据基础,以告知兄弟姐妹的支持。
{"title":"The application of population data linkage to capture sibling health outcomes among children and young adults with neurodevelopmental conditions. A scoping review.","authors":"Caitlin Gray, Helen Leonard, Matthew N Cooper, Dheeraj Rai, Emma J Glasson","doi":"10.23889/ijpds.v10i1.2413","DOIUrl":"10.23889/ijpds.v10i1.2413","url":null,"abstract":"<p><strong>Introduction: </strong>Siblings of children with neurodevelopmental conditions have unique experiences and challenges related to their sibling role. Some develop mental health concerns as measured by self-reported surveys or parent report. Few data are available at the population level, owing to difficulties capturing wide-scale health data for siblings. Data linkage is a technique that can facilitate such research.</p><p><strong>Objective: </strong>To explore the application of population data linkage as a research method to capture health outcomes of siblings of children with neurodevelopmental conditions.</p><p><strong>Inclusion criteria: </strong>Peer reviewed papers that captured health outcomes for siblings of children and young adults with neurodevelopmental conditions using population data linkage.</p><p><strong>Methods: </strong>JBI Scoping review methods were followed. Papers were searched within CINAHL, Ovid, Scopus, and Web of Science from 2000 to 2024 using search terms relating to 'data linkage' 'neurodevelopmental conditions' 'siblings' and 'health outcomes'.</p><p><strong>Results: </strong>The final data extraction included 31 papers. The neurodevelopmental conditions of index children were autism, attention deficit hyperactivity disorder, intellectual disability, cerebral palsy and developmental delay. The mean follow-up time was 31 years, and the majority of studies originated from Scandinavia. Sibling health outcomes observed were psychiatric diagnoses, self-harm and suicide, other neurodevelopmental conditions, and medical conditions such as atopic disease, cancer and obesity.</p><p><strong>Conclusion: </strong>Data linkage can help capture sibling health outcomes quickly across large cohorts with a range of neurodevelopmental conditions. Future research could be enhanced by focusing on siblings as the primary group of interest, increased integration of genealogical data, and comparisons between diagnostic groups and severity levels. Adoption of established rigorous reporting methods will increase the replicability of this type of research, and provide a stronger evidence-base from which to inform sibling supports.</p>","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":"10 1","pages":"2413"},"PeriodicalIF":1.6,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11923734/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143671252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-11eCollection Date: 2023-01-01DOI: 10.23889/ijpds.v8i5.2935
Joseph Lam, Mario Cortina-Borja, Robert Aldridge, Ruth Blackburn, Katie Harron
Accurate data linkage across large administrative databases is crucial for addressing complex research and policy questions, yet linkage errors-stemming from inconsistent name representations-can introduce biases, predominantly for names not given in English. This data note examines the impact of romanisation on linkage accuracy, focusing on Chinese names and comparing standardised systems (Jyutping and Pinyin) with the non-standardised Hong Kong Government Cantonese Romanisation (HKG-romanisation). We identify three primary issues: language-specific variations in romanisation, the loss of tonal information inherent to tonal languages, and discrepancies in name order conventions. Using a dataset of 771 Hong Kong student names, our analysis reveals that standardised romanisation systems enhance the uniqueness and consistency of name representations, thereby improving linkage precision and recall compared to HKG-romanisation. Specifically, Jyutping and Pinyin achieved over 95% recall in blocking strategies, whereas HKG-romanisation only reached 68.8%. Incorporating tonal information further improved recall. These findings underscore the necessity of adopting standardised, tone-sensitive romanisation systems and flexible database designs to reduce linkage errors and promote data equity for under-represented groups. We advocate for the implementation of phonetic encodings in databases, alongside language-specific pre-processing protocols, to ensure more inclusive and accurate data linkage processes.
{"title":"Data Note: Alternative Name Encodings - Using Jyutping or Pinyin as tonal representations of Chinese names for data linkage.","authors":"Joseph Lam, Mario Cortina-Borja, Robert Aldridge, Ruth Blackburn, Katie Harron","doi":"10.23889/ijpds.v8i5.2935","DOIUrl":"10.23889/ijpds.v8i5.2935","url":null,"abstract":"<p><p>Accurate data linkage across large administrative databases is crucial for addressing complex research and policy questions, yet linkage errors-stemming from inconsistent name representations-can introduce biases, predominantly for names not given in English. This data note examines the impact of romanisation on linkage accuracy, focusing on Chinese names and comparing standardised systems (Jyutping and Pinyin) with the non-standardised Hong Kong Government Cantonese Romanisation (HKG-romanisation). We identify three primary issues: language-specific variations in romanisation, the loss of tonal information inherent to tonal languages, and discrepancies in name order conventions. Using a dataset of 771 Hong Kong student names, our analysis reveals that standardised romanisation systems enhance the uniqueness and consistency of name representations, thereby improving linkage precision and recall compared to HKG-romanisation. Specifically, Jyutping and Pinyin achieved over 95% recall in blocking strategies, whereas HKG-romanisation only reached 68.8%. Incorporating tonal information further improved recall. These findings underscore the necessity of adopting standardised, tone-sensitive romanisation systems and flexible database designs to reduce linkage errors and promote data equity for under-represented groups. We advocate for the implementation of phonetic encodings in databases, alongside language-specific pre-processing protocols, to ensure more inclusive and accurate data linkage processes.</p>","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":"8 5","pages":"2935"},"PeriodicalIF":1.6,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11897931/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143616678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-06eCollection Date: 2025-01-01DOI: 10.23889/ijpds.v10i1.2496
Michael O Budu, Katherine W Kooij, Kate Heath, Taylor McLinden, Claudette Cardinal, Scott D Emerson, Paul Sereda, Jason Trigg, Jenny Li, Erin Ding, Mark W Hull, Kate Salters, Viviane D Lima, Rolando Barrios, Julio S G Montaner, Robert S Hogg
Introduction: The Comparative Outcomes and Service Utilization Trends (COAST) study compares health outcomes among People With HIV (PWH) and People Without HIV (PWoH) in British Columbia (BC), Canada. The cohort was recently updated to include persons diagnosed with HIV after March 31, 2013, and expanded to broaden research applications.
Methods: COAST includes PWH and a 10% random sample of the general population without HIV, all aged ≥19. Our study links an HIV registry to healthcare practitioner billing, hospital and emergency department attendance data, prescription drug dispensations, and a cancer registry. Our cohort update included new sampling strategies, adding data on emergency department visits not previously captured, and extending our follow-up period to 28 years (from 1992 to 2020). COAST now includes 17,119 PWH and 615,264 PWoH.
Findings to date: COAST has contributed to our understanding of combination antiretroviral therapy (ART) use, health service utilization, chronic diseases, mental health and substance use disorders, and mortality among PWH in BC. Key findings include earlier age at diagnosis of certain chronic conditions, a higher incidence of mood disorders among PWH, and noteworthy shifts in causes of death among PWH on ART. The updated cohort will provide insights into the changing nature of the population living with HIV in BC and serves as a novel foundation for further research.
Future plans: To explore and extend knowledge of the evolving trends among people living and aging with HIV in BC, regular data linkage updates and the inclusion of additional datasets are scheduled every two years.
{"title":"Cohort Profile Update: Reflecting back and looking ahead: Updating the Comparative Outcomes and Service Utilization Trends (COAST) Study to include 28 years of linked data from people with and without HIV in British Columbia, Canada.","authors":"Michael O Budu, Katherine W Kooij, Kate Heath, Taylor McLinden, Claudette Cardinal, Scott D Emerson, Paul Sereda, Jason Trigg, Jenny Li, Erin Ding, Mark W Hull, Kate Salters, Viviane D Lima, Rolando Barrios, Julio S G Montaner, Robert S Hogg","doi":"10.23889/ijpds.v10i1.2496","DOIUrl":"10.23889/ijpds.v10i1.2496","url":null,"abstract":"<p><strong>Introduction: </strong>The Comparative Outcomes and Service Utilization Trends (COAST) study compares health outcomes among People With HIV (PWH) and People Without HIV (PWoH) in British Columbia (BC), Canada. The cohort was recently updated to include persons diagnosed with HIV after March 31, 2013, and expanded to broaden research applications.</p><p><strong>Methods: </strong>COAST includes PWH and a 10% random sample of the general population without HIV, all aged ≥19. Our study links an HIV registry to healthcare practitioner billing, hospital and emergency department attendance data, prescription drug dispensations, and a cancer registry. Our cohort update included new sampling strategies, adding data on emergency department visits not previously captured, and extending our follow-up period to 28 years (from 1992 to 2020). COAST now includes 17,119 PWH and 615,264 PWoH.</p><p><strong>Findings to date: </strong>COAST has contributed to our understanding of combination antiretroviral therapy (ART) use, health service utilization, chronic diseases, mental health and substance use disorders, and mortality among PWH in BC. Key findings include earlier age at diagnosis of certain chronic conditions, a higher incidence of mood disorders among PWH, and noteworthy shifts in causes of death among PWH on ART. The updated cohort will provide insights into the changing nature of the population living with HIV in BC and serves as a novel foundation for further research.</p><p><strong>Future plans: </strong>To explore and extend knowledge of the evolving trends among people living and aging with HIV in BC, regular data linkage updates and the inclusion of additional datasets are scheduled every two years.</p>","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":"10 1","pages":"2496"},"PeriodicalIF":2.2,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11922098/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143665089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-03eCollection Date: 2025-01-01DOI: 10.23889/ijpds.v10i1.2467
Claudia Medina Coeli, Rosa Maria Soares Madeira Domingues, Lana Meijinhos, Daniela Medina Coeli Bastos, Rejane Sobrino Pinheiro, Valeria Saraceni, Marcos Augusto Bastos Dias, Natália Santana Paiva, Kenneth Rochel de Camargo
Introduction: The absence of a unique patient identifier in the Brazilian hospital administrative database prevents the identification of hospital episodes with multiple hospitalisations of the same patient.
Objectives: This study aims to evaluate the information gain by using a computer routine to identify acute Obstetrics hospital episodes and its impact on assessing marks of case severity.
Methods: The data source was a de-identified Brazilian hospital administrative database from 2017 to 2020, including hospitalisations records of women of reproductive age (10 to 49 years old) for treating acute conditions (N=16,087,490). We processed this database by combining C++ and Python routines to create a hospital episodes database. From the latter, we selected obstetrics hospital episodes from 2018 to 2019 (N = 4,926,877). We compared selected characteristics of the hospital episodes according to their type (multiple vs single records per episode), testing for differences using effect size measures. We compared relative differences in case severity marks when using the hospital episode as the unit of analysis to that of isolated hospitalisations (N = 5,018,350).
Results: Compared to single-record episodes, multiple-records episodes had longer length of stay, higher amount reimbursed, and lower proportion of discharge alive. When comparing isolated hospitalisations to hospital episodes analysis, we observed an increase in all case severity indicators, especially for hospital deaths, with an increment of 13.15%. The computer routine decreased the hospital admissions with a reason for hospital discharge that did not indicate the outcome (hospital stay or inter-hospital transfer) from 2.29% to 0.73.
Conclusions: The deterministic matching computer routine proved valuable for identifying records that refer to the same hospital episode, which improved the assessment of severe cases.
{"title":"Using a deterministic matching computer routine to identify hospital episodes in a Brazilian de-identified administrative database for the analysis of obstetrics hospitalisations.","authors":"Claudia Medina Coeli, Rosa Maria Soares Madeira Domingues, Lana Meijinhos, Daniela Medina Coeli Bastos, Rejane Sobrino Pinheiro, Valeria Saraceni, Marcos Augusto Bastos Dias, Natália Santana Paiva, Kenneth Rochel de Camargo","doi":"10.23889/ijpds.v10i1.2467","DOIUrl":"10.23889/ijpds.v10i1.2467","url":null,"abstract":"<p><strong>Introduction: </strong>The absence of a unique patient identifier in the Brazilian hospital administrative database prevents the identification of hospital episodes with multiple hospitalisations of the same patient.</p><p><strong>Objectives: </strong>This study aims to evaluate the information gain by using a computer routine to identify acute Obstetrics hospital episodes and its impact on assessing marks of case severity.</p><p><strong>Methods: </strong>The data source was a de-identified Brazilian hospital administrative database from 2017 to 2020, including hospitalisations records of women of reproductive age (10 to 49 years old) for treating acute conditions (N=16,087,490). We processed this database by combining C++ and Python routines to create a hospital episodes database. From the latter, we selected obstetrics hospital episodes from 2018 to 2019 (N = 4,926,877). We compared selected characteristics of the hospital episodes according to their type (multiple vs single records per episode), testing for differences using effect size measures. We compared relative differences in case severity marks when using the hospital episode as the unit of analysis to that of isolated hospitalisations (N = 5,018,350).</p><p><strong>Results: </strong>Compared to single-record episodes, multiple-records episodes had longer length of stay, higher amount reimbursed, and lower proportion of discharge alive. When comparing isolated hospitalisations to hospital episodes analysis, we observed an increase in all case severity indicators, especially for hospital deaths, with an increment of 13.15%. The computer routine decreased the hospital admissions with a reason for hospital discharge that did not indicate the outcome (hospital stay or inter-hospital transfer) from 2.29% to 0.73.</p><p><strong>Conclusions: </strong>The deterministic matching computer routine proved valuable for identifying records that refer to the same hospital episode, which improved the assessment of severe cases.</p>","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":"10 1","pages":"2467"},"PeriodicalIF":1.6,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11874899/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143558254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}