Introduction: This paper presents a Four Question Framework to guide data integration partners in building a strong governance and legal foundation to support ethical data use.
Objectives: While this framework was developed based on work in the United States that routinely integrates public data, it is meant to be a simple, digestible tool that can be adapted to any context.
Methods: The framework was developed through a series of public deliberation workgroups and 15 years of field experience working with a diversity of data integration efforts across the United States.
Results: The Four Questions-Is this legal? Is this ethical? Is this a good idea? How do we know (and who decides)?-should be considered within an established data governance framework and alongside core partners to determine whether and how to move forward when building an Integrated Data System (IDS) and also at each stage of a specific data project. We discuss these questions in depth, with a particular focus on the role of governance in establishing legal and ethical data use. In addition, we provide example data governance structures from two IDS sites and hypothetical scenarios that illustrate key considerations for the Four Question Framework.
Conclusions: A robust governance process is essential for determining whether data sharing and integration is legal, ethical, and a good idea within the local context. This process is iterative and as relational as it is technical, which means authentic collaboration across partners should be prioritized at each stage of a data use project. The Four Questions serve as a guide for determining whether to undertake data sharing and integration and should be regularly revisited throughout the life of a project.
Highlights: Strong data governance has five qualities: it is purpose-, value-, and principle-driven; strategically located; collaborative; iterative; and transparent.Through a series of public deliberation workgroups and 15 years of field experience, we developed a Four Question Framework to determine whether and how to move forward with building an IDS and at each stage of a data sharing and integration project.The Four Questions-Is this legal? Is this ethical? Is this a good idea? How do we know (and who decides)?-should be carefully considered within established data governance processes and among core partners.
Introduction: "Big data" - including linked administrative data - can be exploited to evaluate interventions for maternal and child health, providing time- and cost-effective alternatives to randomised controlled trials. However, using these data to evaluate population-level interventions can be challenging.
Objectives: We aimed to inform future evaluations of complex interventions by describing sources of bias, lessons learned, and suggestions for improvements, based on two observational studies using linked administrative data from health, education and social care sectors to evaluate the Family Nurse Partnership (FNP) in England and Scotland.
Methods: We first considered how different sources of potential bias within the administrative data could affect results of the evaluations. We explored how each study design addressed these sources of bias using maternal confounders captured in the data. We then determined what additional information could be captured at each step of the complex intervention to enable analysts to minimise bias and maximise comparability between intervention and usual care groups, so that any observed differences can be attributed to the intervention.
Results: Lessons learned include the need for i) detailed data on intervention activity (dates/geography) and usual care; ii) improved information on data linkage quality to accurately characterise control groups; iii) more efficient provision of linked data to ensure timeliness of results; iv) better measurement of confounding characteristics affecting who is eligible, approached and enrolled.
Conclusions: Linked administrative data are a valuable resource for evaluations of the FNP national programme and other complex population-level interventions. However, information on local programme delivery and usual care are required to account for biases that characterise those who receive the intervention, and to inform understanding of mechanisms of effect. National, ongoing, robust evaluations of complex public health evaluations would be more achievable if programme implementation was integrated with improved national and local data collection, and robust quasi-experimental designs.
Introduction: Research to date has established that the COVID-19 pandemic has not impacted everyone equitably. Whether this unequitable impact was seen educationally with regards to educator reported barriers to distance learning, concerns and mental health is less clear.
Objective: The objective of this study was to explore the association between the neighbourhood composition of the school and kindergarten educator-reported barriers and concerns regarding children's learning during the first wave of COVID-19 related school closures in Ontario, Canada.
Methods: In the spring of 2020, we collected data from Ontario kindergarten educators (n = 2569; 74.2% kindergarten teachers, 25.8% early childhood educators; 97.6% female) using an online survey asking them about their experiences and challenges with online learning during the first round of school closures. We linked the educator responses to 2016 Canadian Census variables based on schools' postal codes. Bivariate correlations and Poisson regression analyses were used to determine if there was an association between neighbourhood composition and educator mental health, and the number of barriers and concerns reported by kindergarten educators.
Results: There were no significant findings with educator mental health and school neighbourhood characteristics. Educators who taught at schools in neighbourhoods with lower median income reported a greater number of barriers to online learning (e.g., parents/guardians not submitting assignments/providing updates on their child's learning) and concerns regarding the return to school in the fall of 2020 (e.g., students' readjustment to routines). There were no significant associations with educator reported barriers or concerns and any of the other Census neighbourhood variables (proportion of lone parent families, average household size, proportion of population that do no speak official language, proportion of population that are recent immigrants, or proportion of population ages 0-4).
Conclusions: Overall, our study suggests that the neighbourhood composition of the children's school location did not exacerbate the potential negative learning experiences of kindergarten students and educators during the COVID-19 pandemic, although we did find that educators teaching in schools in lower-SES neighbourhoods reported more barriers to online learning during this time. Taken together, our study suggests that remediation efforts should be focused on individual kindergarten children and their families as opposed to school location.
Introduction: Population wide educational attainment registers are necessary for educational planning and research. Regular linking of databases is needed to build and update such a register. Without availability of unique national identification numbers, record linkage must be based on quasi-identifiers such as name, date of birth and sex. However, the data protection principle of data minimization aims to minimize the set of identifiers in databases.
Objectives: Therefore, the German Federal Ministry of Research and Education commissioned a study to inform legislation on the minimum set of identifiers required for a national educational register.
Methods: To justify our recommendations empirically, we implemented a microsimulation of about 20 million people. The simulated register accumulates changes and errors in identifiers due to migration, regional mobility, marriage, school career and mortality, thereby allowing the study of errors on longitudinal datasets. Updated records were linked yearly to the simulated register using several linkage methods. Clear-text methods as well as privacy-preserving (PPRL) methods were compared.
Results: The results indicate linkage bias if only the primary identifiers are available in the register. More detailed identifiers, including place of birth, are required to minimize linkage bias. The amount of information available to identify a person for matching is more critical for linkage quality than the record linkage method applied. Differences in linkage quality between the best procedures (probabilistic linkage and multiple matchkeys) are minor.
Conclusions: Microsimulation is a valuable tool for designing record linkage procedures. By modelling the processes resulting in changes or errors in quasi-identifiers, predicting data quality to be expected after the implementation of a register seems possible.
Introduction: The Patient Master Index (PMI) plays an important role in management of patient information and epidemiological research, and the availability of unique patient identifiers improves the accuracy when linking patient records across disparate datasets. In our environment, however, a unique identifier is seldom present in all datasets containing patient information. Quasi identifiers are used to attempt to link patient records but sometimes present higher risk of over-linking. Data quality and completeness thus affect the ability to make correct linkages.
Aim: This paper describes the record linkage system that is currently implemented at the Provincial Health Data Centre (PHDC) in the Western Cape, South Africa, and assesses its output to date.
Methods: We apply a stepwise deterministic record linkage approach to link patient data that are routinely collected from health information systems in the Western Cape province of South Africa. Variables used in the linkage process include South African National Identity number (RSA ID), date of birth, year of birth, month of birth, day of birth, residential address and contact information. Descriptive analyses are used to estimate the level and extent of duplication in the provincial PMI.
Results: The percentage of duplicates in the provincial PMI lies between 10% and 20%. Duplicates mainly arise from spelling errors, and surname and first names carry most of the errors, with the first names and surname being different for the same individual in approximately 22% of duplicates. The RSA ID is the variable mostly affected by poor completeness with less than 30% of the records having an RSA ID.The current linkage algorithm requires refinement as it makes use of algorithms that have been developed and validated on anglicised names which might not work well for local names. Linkage is also affected by data quality-related issues that are associated with the routine nature of the data which often make it difficult to validate and enforce integrity at the point of data capture.
Data collection, analysis, and data driven action cycles have been viewed as vital components of healthcare for decades. Throughout the COVID-19 pandemic, case incidence and mortality data have consistently been used by various levels of governments and health institutions to inform pandemic strategies and service distribution. However, these responses are often inequitable, underscoring pre-existing healthcare disparities faced by marginalized populations. This has prompted governments to finally face these disparities and find ways to quickly deliver more equitable pandemic support. These rapid data informed supports proved that learning health systems (LHS) could be quickly mobilized and effectively used to develop healthcare actions that delivered healthcare interventions that matched diverse populations' needs in equitable and affordable ways. Within LHS, data are viewed as a starting point researchers can use to inform practice and subsequent research. Despite this innovative approach, the quality and depth of data collection and robust analyses varies throughout healthcare, with data lacking across the quadruple aims. Often, large data gaps pertaining to community socio-demographics, patient perceptions of healthcare quality and the social determinants of health exist. This prevents a robust understanding of the healthcare landscape, leaving marginalized populations uncounted and at the sidelines of improvement efforts. These gaps are often viewed by researchers as indication that more data is needed rather than an opportunity to critically analyze and iteratively learn from multiple sources of pre-existing data. This continued cycle of data collection and analysis leaves one to wonder if healthcare has a data problem or a learning problem. In this commentary, we discuss ways healthcare data are often used and how LHS disrupts this cycle, turning data into learning opportunities that inform healthcare practice and future research in real time. We conclude by proposing several ways to make learning from data just as important as the data itself.