Pub Date : 2023-10-02DOI: 10.23889/ijpds.v8i1.2144
Alexander Hartenstein, Khaled Abdelgawwad, Frank Kleinjung, Stephen Privitera, Thomas Viethen, Tatsiana Vaitsiakhovich
IntroductionIn randomised controlled trials (RCTs), bleeding outcomes are often assessed using definitions provided by the International Society on Thrombosis and Haemostasis (ISTH). Information relating to bleeding events in real-world evidence (RWE) sources are not identified using these definitions. To assist with accurate comparisons between clinical trials and real-world studies, algorithms are required for the identification of ISTH-defined bleeding events in RWE sources. ObjectivesTo present a novel algorithm to identify ISTH-defined major and clinically-relevant non-major (CRNM) bleeding events in a US Electronic Health Record (EHR) database. MethodsThe ISTH definition for major bleeding was divided into three subclauses: fatal bleeds, critical organ bleeds and symptomatic bleeds associated with haemoglobin reductions. Data elements from EHRs required to identify patients fulfilling these subclauses (algorithm components) were defined according to International Classification of Diseases, 9th and 10th Revisions, Clinical Modification disease codes that describe key bleeding events. Other data providing context to bleeding severity included in the algorithm were: `interaction type' (diagnosis in the inpatient or outpatient setting), `position' (primary/discharge or secondary diagnosis), haemoglobin values from laboratory tests, blood transfusion codes and mortality data. ResultsIn the final algorithm, the components were combined to align with the subclauses of ISTH definitions for major and CRNM bleeds. A matrix was proposed to guide identification of ISTH bleeding events in the EHR database. The matrix categorises bleeding events by combining data from algorithm components, including: diagnosis codes, 'interaction type', 'position', decreases in haemoglobin concentrations (≥2 g/dL over 48 hours) and mortality. ConclusionsThe novel algorithm proposed here identifies ISTH major and CRNM bleeding events that are commonly investigated in RCTs in a real-world EHR data source. This algorithm could facilitate comparison between the frequency of bleeding outcomes recorded in clinical trials and RWE. Validation of algorithm performance is in progress.
{"title":"Identification of International Society on Thrombosis and Haemostasis major and clinically relevant non-major bleed events from electronic health records: a novel algorithm to enhance data utilisation from real-world sources","authors":"Alexander Hartenstein, Khaled Abdelgawwad, Frank Kleinjung, Stephen Privitera, Thomas Viethen, Tatsiana Vaitsiakhovich","doi":"10.23889/ijpds.v8i1.2144","DOIUrl":"https://doi.org/10.23889/ijpds.v8i1.2144","url":null,"abstract":"IntroductionIn randomised controlled trials (RCTs), bleeding outcomes are often assessed using definitions provided by the International Society on Thrombosis and Haemostasis (ISTH). Information relating to bleeding events in real-world evidence (RWE) sources are not identified using these definitions. To assist with accurate comparisons between clinical trials and real-world studies, algorithms are required for the identification of ISTH-defined bleeding events in RWE sources. ObjectivesTo present a novel algorithm to identify ISTH-defined major and clinically-relevant non-major (CRNM) bleeding events in a US Electronic Health Record (EHR) database. MethodsThe ISTH definition for major bleeding was divided into three subclauses: fatal bleeds, critical organ bleeds and symptomatic bleeds associated with haemoglobin reductions. Data elements from EHRs required to identify patients fulfilling these subclauses (algorithm components) were defined according to International Classification of Diseases, 9th and 10th Revisions, Clinical Modification disease codes that describe key bleeding events. Other data providing context to bleeding severity included in the algorithm were: `interaction type' (diagnosis in the inpatient or outpatient setting), `position' (primary/discharge or secondary diagnosis), haemoglobin values from laboratory tests, blood transfusion codes and mortality data. ResultsIn the final algorithm, the components were combined to align with the subclauses of ISTH definitions for major and CRNM bleeds. A matrix was proposed to guide identification of ISTH bleeding events in the EHR database. The matrix categorises bleeding events by combining data from algorithm components, including: diagnosis codes, 'interaction type', 'position', decreases in haemoglobin concentrations (≥2 g/dL over 48 hours) and mortality. ConclusionsThe novel algorithm proposed here identifies ISTH major and CRNM bleeding events that are commonly investigated in RCTs in a real-world EHR data source. This algorithm could facilitate comparison between the frequency of bleeding outcomes recorded in clinical trials and RWE. Validation of algorithm performance is in progress.","PeriodicalId":132937,"journal":{"name":"International Journal for Population Data Science","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135830188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-28DOI: 10.23889/ijpds.v8i5.2177
Jason Edward Black, Amanda Terry, Sonny Cejic, Thomas Freeman, Daniel Lizotte, Scott McKay, Mark Speechley, Bridget Ryan
IntroductionWe set out to assess the impact of Choosing Wisely Canada recommendations (2014) on reducing unnecessary health investigations and interventions in primary care across Southwestern Ontario. MethodsWe used the Deliver Primary Healthcare Information (DELPHI) database, which stores deidentified electronic medical records (EMR) of nearly 65,000 primary care patients across Southwestern Ontario. When conducting research using EMR data, data provenance (i.e., how the data came to be) should first be established. We first considered DELPHI data provenance in relation to longitudinal analyses, flagging a change in EMR software that occurred during 2012 and 2013. We attempted to link records between EMR databases produced by different software using probabilistic linkage and inspected 10 years of data in the DELPHI database (2009 to 2019) for data quality issues, including comparability over time. ResultsWe encountered several issues resulting from this change in EMR software. These included limited linkage of records between software without a common identifier; data migration issues that distorted procedure dates; and unusual changes in laboratory test and medication prescription volumes. ConclusionThis study reinforces the necessity of assessing data provenance and quality for new research projects. By understanding data provenance, we can anticipate related data quality issues such as changes in EMR data over time-which represent a growing concern as longitudinal data analyses increase in feasibility and popularity.
{"title":"Understanding data provenance when using electronic medical records for research: Lessons learned from the Deliver Primary Healthcare Information (DELPHI) database","authors":"Jason Edward Black, Amanda Terry, Sonny Cejic, Thomas Freeman, Daniel Lizotte, Scott McKay, Mark Speechley, Bridget Ryan","doi":"10.23889/ijpds.v8i5.2177","DOIUrl":"https://doi.org/10.23889/ijpds.v8i5.2177","url":null,"abstract":"IntroductionWe set out to assess the impact of Choosing Wisely Canada recommendations (2014) on reducing unnecessary health investigations and interventions in primary care across Southwestern Ontario. MethodsWe used the Deliver Primary Healthcare Information (DELPHI) database, which stores deidentified electronic medical records (EMR) of nearly 65,000 primary care patients across Southwestern Ontario. When conducting research using EMR data, data provenance (i.e., how the data came to be) should first be established. We first considered DELPHI data provenance in relation to longitudinal analyses, flagging a change in EMR software that occurred during 2012 and 2013. We attempted to link records between EMR databases produced by different software using probabilistic linkage and inspected 10 years of data in the DELPHI database (2009 to 2019) for data quality issues, including comparability over time. ResultsWe encountered several issues resulting from this change in EMR software. These included limited linkage of records between software without a common identifier; data migration issues that distorted procedure dates; and unusual changes in laboratory test and medication prescription volumes. ConclusionThis study reinforces the necessity of assessing data provenance and quality for new research projects. By understanding data provenance, we can anticipate related data quality issues such as changes in EMR data over time-which represent a growing concern as longitudinal data analyses increase in feasibility and popularity.","PeriodicalId":132937,"journal":{"name":"International Journal for Population Data Science","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135386572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-25DOI: 10.23889/ijpds.v8i1.2151
Julie-Anne Smit, Rieke Van der Graaf, Menno Mostert, Ilonca Vaartjes, Mira Zuidgeest, Diederik Grobbee, Johannes J.M. van Delden
IntroductionData linkage for health research purposes enables the answering of countless new research questions, is said to be cost effective and less intrusive than other means of data collection. Nevertheless, health researchers are currently dealing with a complicated, fragmented, and inconsistent regulatory landscape with regard to the processing of data, and progress in health research is hindered. AimWe designed a qualitative study to assess what different stakeholders perceive as ethical and legal obstacles to data linkage for health research purposes, and how these obstacles could be overcome. MethodsTwo focus groups and eighteen semi-structured in-depth interviews were held to collect opinions and insights of various stakeholders. An inductive thematic analysis approach was used to identify overarching themes. ResultsThis study showed that the ambiguity regarding the `correct' interpretation of the law, the fragmentation of policies governing the processing of personal health data, and the demandingness of legal requirements are experienced as causes for the impediment of data linkage for research purposes by the participating stakeholders. To remove or reduce these obstacles authoritative interpretations of the laws and regulations governing data linkage should be issued. The participants furthermore encouraged the harmonisation of data linkage policies, as well as promoting trust and transparency and the enhancement of technical and organisational measures. Lastly, there is a demand for legislative and regulatory modifications amongst the participants. ConclusionsTo overcome the obstacles in data linkage for scientific research purposes, perhaps we should shift the focus from adapting the current laws and regulations governing data linkage, or even designing completely new laws, towards creating a more thorough understanding of the law and making better use of the flexibilities within the existing legislation. Important steps in achieving this shift could be clarification of the legal provisions governing data linkage by issuing authoritative interpretations, as well as the strengthening of ethical-legal oversight bodies.
{"title":"Overcoming ethical and legal obstacles to data linkage in health research: stakeholder perspectives","authors":"Julie-Anne Smit, Rieke Van der Graaf, Menno Mostert, Ilonca Vaartjes, Mira Zuidgeest, Diederik Grobbee, Johannes J.M. van Delden","doi":"10.23889/ijpds.v8i1.2151","DOIUrl":"https://doi.org/10.23889/ijpds.v8i1.2151","url":null,"abstract":"IntroductionData linkage for health research purposes enables the answering of countless new research questions, is said to be cost effective and less intrusive than other means of data collection. Nevertheless, health researchers are currently dealing with a complicated, fragmented, and inconsistent regulatory landscape with regard to the processing of data, and progress in health research is hindered. AimWe designed a qualitative study to assess what different stakeholders perceive as ethical and legal obstacles to data linkage for health research purposes, and how these obstacles could be overcome. MethodsTwo focus groups and eighteen semi-structured in-depth interviews were held to collect opinions and insights of various stakeholders. An inductive thematic analysis approach was used to identify overarching themes. ResultsThis study showed that the ambiguity regarding the `correct' interpretation of the law, the fragmentation of policies governing the processing of personal health data, and the demandingness of legal requirements are experienced as causes for the impediment of data linkage for research purposes by the participating stakeholders. To remove or reduce these obstacles authoritative interpretations of the laws and regulations governing data linkage should be issued. The participants furthermore encouraged the harmonisation of data linkage policies, as well as promoting trust and transparency and the enhancement of technical and organisational measures. Lastly, there is a demand for legislative and regulatory modifications amongst the participants. ConclusionsTo overcome the obstacles in data linkage for scientific research purposes, perhaps we should shift the focus from adapting the current laws and regulations governing data linkage, or even designing completely new laws, towards creating a more thorough understanding of the law and making better use of the flexibilities within the existing legislation. Important steps in achieving this shift could be clarification of the legal provisions governing data linkage by issuing authoritative interpretations, as well as the strengthening of ethical-legal oversight bodies.","PeriodicalId":132937,"journal":{"name":"International Journal for Population Data Science","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135816394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-20DOI: 10.23889/ijpds.v8i4.2142
P Alison Paprica, Monique Crichlow, Donna Curtis Maillet, Sarah Kesselring, Conrad Pow, Thomas P. Scarnecchia, Michael J. Schull, Rosario G. Cartagena, Annabelle Cumyn, Salman Dostmohammad, Keith O. Elliston, Michelle Greiver, Amy Hawn Nelson, Sean L. Hill, Wanrudee Isaranuwatchai, Evgueni Loukipoudis, James Ted McDonald, John R. McLaughlin, Alan Rabinowitz, Fahad Razak, Stefaan G. Verhulst, Amol A. Verma, J. Charles Victor, Andrew Young, Joanna Yu, Kimberlyn McGrail
IntroductionAround the world, many organisations are working on ways to increase the use, sharing, and reuse of person-level data for research, evaluation, planning, and innovation while ensuring that data are secure and privacy is protected. As a contribution to broader efforts to improve data governance and management, in 2020 members of our team published 12 minimum specification essential requirements (min specs) to provide practical guidance for organisations establishing or operating data trusts and other forms of data infrastructure. Approach and AimsWe convened an international team, consisting mostly of participants from Canada and the United States of America, to test and refine the original 12 min specs. Twenty-three (23) data-focused organisations and initiatives recorded the various ways they address the min specs. Sub-teams analysed the results, used the findings to make improvements to the min specs, and identified materials to support organisations/initiatives in addressing the min specs. ResultsAnalyses and discussion led to an updated set of 15 min specs covering five categories: one min spec for Legal, five for Governance, four for Management, two for Data Users, and three for Stakeholder & Public Engagement. Multiple changes were made to make the min specs language more technically complete and precise. The updated set of 15 min specs has been integrated into a Canadian national standard that, to our knowledge, is the first to include requirements for public engagement and Indigenous Data Sovereignty. ConclusionsThe testing and refinement of the min specs led to significant additions and improvements. The min specs helped the 23 organisations/initiatives involved in this project communicate and compare how they achieve responsible and trustworthy data governance and management. By extension, the min specs, and the Canadian national standard based on them, are likely to be useful for other data-focused organisations and initiatives.
{"title":"Essential requirements for the governance and management of data trusts, data repositories, and other data collaborations","authors":"P Alison Paprica, Monique Crichlow, Donna Curtis Maillet, Sarah Kesselring, Conrad Pow, Thomas P. Scarnecchia, Michael J. Schull, Rosario G. Cartagena, Annabelle Cumyn, Salman Dostmohammad, Keith O. Elliston, Michelle Greiver, Amy Hawn Nelson, Sean L. Hill, Wanrudee Isaranuwatchai, Evgueni Loukipoudis, James Ted McDonald, John R. McLaughlin, Alan Rabinowitz, Fahad Razak, Stefaan G. Verhulst, Amol A. Verma, J. Charles Victor, Andrew Young, Joanna Yu, Kimberlyn McGrail","doi":"10.23889/ijpds.v8i4.2142","DOIUrl":"https://doi.org/10.23889/ijpds.v8i4.2142","url":null,"abstract":"IntroductionAround the world, many organisations are working on ways to increase the use, sharing, and reuse of person-level data for research, evaluation, planning, and innovation while ensuring that data are secure and privacy is protected. As a contribution to broader efforts to improve data governance and management, in 2020 members of our team published 12 minimum specification essential requirements (min specs) to provide practical guidance for organisations establishing or operating data trusts and other forms of data infrastructure. Approach and AimsWe convened an international team, consisting mostly of participants from Canada and the United States of America, to test and refine the original 12 min specs. Twenty-three (23) data-focused organisations and initiatives recorded the various ways they address the min specs. Sub-teams analysed the results, used the findings to make improvements to the min specs, and identified materials to support organisations/initiatives in addressing the min specs. ResultsAnalyses and discussion led to an updated set of 15 min specs covering five categories: one min spec for Legal, five for Governance, four for Management, two for Data Users, and three for Stakeholder & Public Engagement. Multiple changes were made to make the min specs language more technically complete and precise. The updated set of 15 min specs has been integrated into a Canadian national standard that, to our knowledge, is the first to include requirements for public engagement and Indigenous Data Sovereignty. ConclusionsThe testing and refinement of the min specs led to significant additions and improvements. The min specs helped the 23 organisations/initiatives involved in this project communicate and compare how they achieve responsible and trustworthy data governance and management. By extension, the min specs, and the Canadian national standard based on them, are likely to be useful for other data-focused organisations and initiatives.","PeriodicalId":132937,"journal":{"name":"International Journal for Population Data Science","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136308351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-18DOI: 10.23889/ijpds.v8i3.2276
Giovanni Schiazza
Introduction & BackgroundInternet Memes (IMs) are social, digital artefacts that act as information vectors on social networking sites. Memetic scholarly literature has mainly focused on analysing IMs content with mixed methods. However, little scholarly attention has been devoted to exploring the relationships between IMs and users through survey methodologies.
Users engage with IMs in many ways, but scholarly literature lacks studies on Individual Differences (ID) that might make users more or less prone to engage with them. The results suggest that certain psychological factors may affect IM engagement.
Objectives & ApproachThis study examines how individual determinants relate to general and political internet meme engagement. An exploratory survey design is employed on an online sample of 472 participants.
To measure meme engagement, we develop a novel scale by asking participants how likely (1-7) they are to exhibit certain behaviours (liking, commenting, sharing on account, tagging or sending privately to someone) on general memes and politically-centred ones. The novel scales’ feasibility is tested, achieving good internal reliability (α>0.7) and a good fit in confirmatory factor analysis.
The survey included validated measures of Fear of Missing Out (FOMO), Conspiracy Belief (CB), personality traits (Big-5), Bullshit Receptivity (BR), Critical Reflection Test (CRT) and an adaptation of Social Network Intensity (SNI). All the measures employed achieved good internal consistency (α>0.7)The study thematically groups the measures related to IMs (engagement, familiarity and attitude), social media (FOMO, SNI), cognitive style (CRT, BR, CB), personality (Big-5) and socio-demographics (age, gender, education, ethnicity, nationality, ideology).
Relevance to Digital FootprintsWith increasing interest and research being done on computational analysis of social media and its phenomena, there is a need for survey research to explore connections between IDs and user behaviour through using a mix of validated and novel ad-hoc measures.
User interaction with internet memes creates a data trail that can be used to infer several individual determinants through machine learning techniques. However, further psychological research is needed to assess and underpin the linkages between IDs and IM engagement before inferring IDs on large datasets.
ResultsBivariate correlations reveal that young age, extraversion, neuroticism, SNI, FOMO, BR and CB are positively associated with internet meme engagement regardless of content. Further, t-tests of dependent correlations show that age, FOMO and ideology differ significantly in their correlations between general vs political meme engagement.
Engagement with political IMs is slightly higher in people with left-leaning ideology and lower levels of conscientiousness. A positive attitude towards IMs correlates with a marginally higher openness to experience a
{"title":"Measuring Internet Meme engagement and individual differences: a novel scale and its correlates","authors":"Giovanni Schiazza","doi":"10.23889/ijpds.v8i3.2276","DOIUrl":"https://doi.org/10.23889/ijpds.v8i3.2276","url":null,"abstract":"Introduction & BackgroundInternet Memes (IMs) are social, digital artefacts that act as information vectors on social networking sites. Memetic scholarly literature has mainly focused on analysing IMs content with mixed methods. However, little scholarly attention has been devoted to exploring the relationships between IMs and users through survey methodologies.
 Users engage with IMs in many ways, but scholarly literature lacks studies on Individual Differences (ID) that might make users more or less prone to engage with them. The results suggest that certain psychological factors may affect IM engagement.
 Objectives & ApproachThis study examines how individual determinants relate to general and political internet meme engagement. An exploratory survey design is employed on an online sample of 472 participants.
 To measure meme engagement, we develop a novel scale by asking participants how likely (1-7) they are to exhibit certain behaviours (liking, commenting, sharing on account, tagging or sending privately to someone) on general memes and politically-centred ones. The novel scales’ feasibility is tested, achieving good internal reliability (α>0.7) and a good fit in confirmatory factor analysis.
 The survey included validated measures of Fear of Missing Out (FOMO), Conspiracy Belief (CB), personality traits (Big-5), Bullshit Receptivity (BR), Critical Reflection Test (CRT) and an adaptation of Social Network Intensity (SNI). All the measures employed achieved good internal consistency (α>0.7)The study thematically groups the measures related to IMs (engagement, familiarity and attitude), social media (FOMO, SNI), cognitive style (CRT, BR, CB), personality (Big-5) and socio-demographics (age, gender, education, ethnicity, nationality, ideology).
 Relevance to Digital FootprintsWith increasing interest and research being done on computational analysis of social media and its phenomena, there is a need for survey research to explore connections between IDs and user behaviour through using a mix of validated and novel ad-hoc measures.
 User interaction with internet memes creates a data trail that can be used to infer several individual determinants through machine learning techniques. However, further psychological research is needed to assess and underpin the linkages between IDs and IM engagement before inferring IDs on large datasets.
 ResultsBivariate correlations reveal that young age, extraversion, neuroticism, SNI, FOMO, BR and CB are positively associated with internet meme engagement regardless of content. Further, t-tests of dependent correlations show that age, FOMO and ideology differ significantly in their correlations between general vs political meme engagement.
 Engagement with political IMs is slightly higher in people with left-leaning ideology and lower levels of conscientiousness. A positive attitude towards IMs correlates with a marginally higher openness to experience a","PeriodicalId":132937,"journal":{"name":"International Journal for Population Data Science","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135154241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-18DOI: 10.23889/ijpds.v8i3.2290
James Goulding, Elizabeth Dolan, Gavin Long, Anya Skatova, John Harvey, Gavin Smith, Laila Tata
Introduction & BackgroundThe COVID-19 pandemic led to unparalleled pressure on healthcare services, highlighting the need for improved healthcare planning for respiratory disease outbreaks. With rapid virus diversification, and correspondingly rapid shifts in symptom expression, there is often a complete lack of representative clinical testing data available to modellers. This is especially true at the onset in outbreaks, where traditional epidemiological and statistical approaches that utilise case data ‘ground truths’ are extremely challenging to apply. In this abstract we preview the results of two novel studies that investigate how the use of digital footprint data - in the form of over-the-counter medication sales - might serve as a predictive proxy for underlying and often hidden disease incidence, and the extent to which such data might improve mortality rate forecasting at local area levels.
Objectives & ApproachOver 2 billion transactions logged by a UK high-street health retailer were collated across English local authorities (n=314), generating weekly variables corresponding to a range of health purchase behaviours (e.g cough mixture / pain-relief sales) in each authority. These purchase data were additionally linked to a set of independent variables describing each local authority’s 1. weekly environment (e.g. weather, temperature, pollution), 2. socio-demographics (e.g. age distributions, deprivation levels, population densities) and 3. available local test case data. Machine learning regression models were then deployed to investigate the ability of each of these variable sets to underpin predictions of weekly registered deaths in the 314 authorities that were due to: COVID-19 between Apr 2020 - Dec 2021 (Study 1) or general respiratory disease between March 2016 - Mar 2020 (Study 2). All models were rigorously tested out-of-sample via walk forward cross-validation, and across a range of forecast windows.
Relevance to Digital FootprintsEpidemics such as COVID-19 are recognised as being driven as much by behavioural factors as they are by clinical ones. Indicators of infection rates may be revealed in purchasing and self-medication logs, where there exists rich data: in 2022 UK citizens were reported to generate >1 billion prescriptions; consume ~6,300 tonnes of paracetamol; and spend £572m on cough, cold and sore throat treatments. Application of the digital footprint data logs generated by such activities may hold potential to reveal hidden disease incidence and risk to vulnerable communities, without reliance on prohibitively expensive testing infrastructures.
ResultsEvidence was found that models incorporating digital footprint sales data were able to significantly out-perform models that used variables traditionally associated with respiratory disease alone (e.g. sociodemographics, weather, or case data). In Study 1, XGBoost models were able to optimally predict the number of COVID deaths 21 days in
{"title":"Forecasting local COVID-19/Respiratory Disease mortality via national longitudinal shopping data: the case for integrating digital footprint data into early warning systems","authors":"James Goulding, Elizabeth Dolan, Gavin Long, Anya Skatova, John Harvey, Gavin Smith, Laila Tata","doi":"10.23889/ijpds.v8i3.2290","DOIUrl":"https://doi.org/10.23889/ijpds.v8i3.2290","url":null,"abstract":"Introduction & BackgroundThe COVID-19 pandemic led to unparalleled pressure on healthcare services, highlighting the need for improved healthcare planning for respiratory disease outbreaks. With rapid virus diversification, and correspondingly rapid shifts in symptom expression, there is often a complete lack of representative clinical testing data available to modellers. This is especially true at the onset in outbreaks, where traditional epidemiological and statistical approaches that utilise case data ‘ground truths’ are extremely challenging to apply. In this abstract we preview the results of two novel studies that investigate how the use of digital footprint data - in the form of over-the-counter medication sales - might serve as a predictive proxy for underlying and often hidden disease incidence, and the extent to which such data might improve mortality rate forecasting at local area levels.
 Objectives & ApproachOver 2 billion transactions logged by a UK high-street health retailer were collated across English local authorities (n=314), generating weekly variables corresponding to a range of health purchase behaviours (e.g cough mixture / pain-relief sales) in each authority. These purchase data were additionally linked to a set of independent variables describing each local authority’s 1. weekly environment (e.g. weather, temperature, pollution), 2. socio-demographics (e.g. age distributions, deprivation levels, population densities) and 3. available local test case data. Machine learning regression models were then deployed to investigate the ability of each of these variable sets to underpin predictions of weekly registered deaths in the 314 authorities that were due to: COVID-19 between Apr 2020 - Dec 2021 (Study 1) or general respiratory disease between March 2016 - Mar 2020 (Study 2). All models were rigorously tested out-of-sample via walk forward cross-validation, and across a range of forecast windows.
 Relevance to Digital FootprintsEpidemics such as COVID-19 are recognised as being driven as much by behavioural factors as they are by clinical ones. Indicators of infection rates may be revealed in purchasing and self-medication logs, where there exists rich data: in 2022 UK citizens were reported to generate >1 billion prescriptions; consume ~6,300 tonnes of paracetamol; and spend £572m on cough, cold and sore throat treatments. Application of the digital footprint data logs generated by such activities may hold potential to reveal hidden disease incidence and risk to vulnerable communities, without reliance on prohibitively expensive testing infrastructures.
 ResultsEvidence was found that models incorporating digital footprint sales data were able to significantly out-perform models that used variables traditionally associated with respiratory disease alone (e.g. sociodemographics, weather, or case data). In Study 1, XGBoost models were able to optimally predict the number of COVID deaths 21 days in ","PeriodicalId":132937,"journal":{"name":"International Journal for Population Data Science","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135153276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-18DOI: 10.23889/ijpds.v8i3.2274
Sam Smith, Gavin Smith, John Harvey
Introduction & BackgroundSummarising high-dimensional time series data across multiple entities is an increasingly prevalent problem because mass data collection has become routine in most domains. We propose a method of automatically summarising high-dimensional data.
Objectives & ApproachSummarization in such a context is both with regard to a reduction of the high-dimensional observations and large number of temporal points. While numerous methods to segment and/or summarise time series exist, the properties often do not align with the needs of consumers of the summaries or require the unrealistic setting of parameters. Addressing this, we define a set of broad properties that lead to high utility in a broad class of domains, which are determined by an information theoretic notion of optimality. Intuitively these properties reflect the summarization of such data into lifestates where (1) the number of possible lifestates is limited and shared across entities to allow interpretation and comparison and (2) the number of lifestate-transitions is jointly controlled to provide a parameterless, optimal summarization of both the high sample and temporal dimensionality.
Relevance to Digital FootprintsExample data include: regular survey collection, consumer purchasing history from transactional data (where the number of possible items to choose from is high), or other repeatedly sampled digital data. Within the Digital Footprints domain, concise descriptions of high-dimensional data (summarizations) are extremely important. For example, lifestates within health records could be identified and used to find critical patterns in the decline or recovery of patients.
Conclusions & ImplicationsThis work aims to find segmentations that optimally trade off the number of states and segments that humans must then interpret, while still capturing salient state changes. Building on prior work, we propose a model with complexity controlled by normalised maximum likelihood (NML). In short, the proposed model generates automated summarizations that are both optimally concise and informationally rich, according to information theory, a branch of mathematics.
{"title":"Automatic Lifestate Identification and Clustering","authors":"Sam Smith, Gavin Smith, John Harvey","doi":"10.23889/ijpds.v8i3.2274","DOIUrl":"https://doi.org/10.23889/ijpds.v8i3.2274","url":null,"abstract":"Introduction & BackgroundSummarising high-dimensional time series data across multiple entities is an increasingly prevalent problem because mass data collection has become routine in most domains. We propose a method of automatically summarising high-dimensional data.
 Objectives & ApproachSummarization in such a context is both with regard to a reduction of the high-dimensional observations and large number of temporal points. While numerous methods to segment and/or summarise time series exist, the properties often do not align with the needs of consumers of the summaries or require the unrealistic setting of parameters. Addressing this, we define a set of broad properties that lead to high utility in a broad class of domains, which are determined by an information theoretic notion of optimality. Intuitively these properties reflect the summarization of such data into lifestates where (1) the number of possible lifestates is limited and shared across entities to allow interpretation and comparison and (2) the number of lifestate-transitions is jointly controlled to provide a parameterless, optimal summarization of both the high sample and temporal dimensionality.
 Relevance to Digital FootprintsExample data include: regular survey collection, consumer purchasing history from transactional data (where the number of possible items to choose from is high), or other repeatedly sampled digital data. Within the Digital Footprints domain, concise descriptions of high-dimensional data (summarizations) are extremely important. For example, lifestates within health records could be identified and used to find critical patterns in the decline or recovery of patients.
 Conclusions & ImplicationsThis work aims to find segmentations that optimally trade off the number of states and segments that humans must then interpret, while still capturing salient state changes. Building on prior work, we propose a model with complexity controlled by normalised maximum likelihood (NML). In short, the proposed model generates automated summarizations that are both optimally concise and informationally rich, according to information theory, a branch of mathematics.","PeriodicalId":132937,"journal":{"name":"International Journal for Population Data Science","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135154080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-18DOI: 10.23889/ijpds.v8i3.2266
Roberto Mansilla, Gavin Long, Simon Welham, John Harvey, Evgeniya Lukinova, Georgiana Nica-Avram, Gavin Smith, Andrew Smith, James Goulding
Introduction & BackgroundThe shift towards plant-based diets remains on the rise. Several observational studies have suggested that adopting these diets can result in some fundamental nutrient deficiencies, such as iodine deficiency. This can be especially harmful to a developing fetus, leading to growth impairment and, in extreme cases, cretinism. Nonetheless, understanding of long-term health consequences of this shift remains a challenge, particularly regarding nutritional impact at broader population scales.
Objectives & ApproachOur study focuses on the effects of transitioning to plant-based diets on the purchasing and assumed intake of essential nutrients like iodine, calcium, and vitamin B12. We analysed anonymized shopping records of 10,626 loyal customers who switched from regular milk to alternative milk products. By matching the transaction data with nutritional information, we estimated the weekly nutrient purchases before and after the transition. Our data was collected from a national food retailer across the UK.
Relevance to Digital FootprintsLoyalty-card transactional logs held by retailers reflect a valuable lens into nutritional intake data. This data can provide insight into the potential impact of purchasing behaviours, such as the potential health effects of dietary changes at scale. Our approach leverages AI modelling accompanied by rigorous variable importance methods to uncover potentially hidden insights on the impact of nutritional shifts to plant-based goods.
ResultsResults indicate that 83% of individuals deemed regular customers, who switched to plant-based milk, experienced a decrease in their purchases of iodine (44%), calcium (30%), and vitamin B12 (39%) from their normal purchase patterns at the retailer. Additionally, 57% of these individuals decreased their iodine purchases by more than 50%. The reduction in these nutrients is even more significant for those who switch to plant-based dairy and meat products.
Conclusions & ImplicationsOur research indicates that dietary changes, such as switching from purchasing regular milk to alternative milks, may lead to insufficient intake of essential dietary nutrients such as iodine. This represents a significant potential health concern for the public if not remediated, especially in countries that do not require salt to be fortified with iodine.
{"title":"Identifying and understanding dietary transitions and nutrient deficiency from loyalty card digital footprints","authors":"Roberto Mansilla, Gavin Long, Simon Welham, John Harvey, Evgeniya Lukinova, Georgiana Nica-Avram, Gavin Smith, Andrew Smith, James Goulding","doi":"10.23889/ijpds.v8i3.2266","DOIUrl":"https://doi.org/10.23889/ijpds.v8i3.2266","url":null,"abstract":"Introduction & BackgroundThe shift towards plant-based diets remains on the rise. Several observational studies have suggested that adopting these diets can result in some fundamental nutrient deficiencies, such as iodine deficiency. This can be especially harmful to a developing fetus, leading to growth impairment and, in extreme cases, cretinism. Nonetheless, understanding of long-term health consequences of this shift remains a challenge, particularly regarding nutritional impact at broader population scales.
 Objectives & ApproachOur study focuses on the effects of transitioning to plant-based diets on the purchasing and assumed intake of essential nutrients like iodine, calcium, and vitamin B12. We analysed anonymized shopping records of 10,626 loyal customers who switched from regular milk to alternative milk products. By matching the transaction data with nutritional information, we estimated the weekly nutrient purchases before and after the transition. Our data was collected from a national food retailer across the UK.
 Relevance to Digital FootprintsLoyalty-card transactional logs held by retailers reflect a valuable lens into nutritional intake data. This data can provide insight into the potential impact of purchasing behaviours, such as the potential health effects of dietary changes at scale. Our approach leverages AI modelling accompanied by rigorous variable importance methods to uncover potentially hidden insights on the impact of nutritional shifts to plant-based goods.
 ResultsResults indicate that 83% of individuals deemed regular customers, who switched to plant-based milk, experienced a decrease in their purchases of iodine (44%), calcium (30%), and vitamin B12 (39%) from their normal purchase patterns at the retailer. Additionally, 57% of these individuals decreased their iodine purchases by more than 50%. The reduction in these nutrients is even more significant for those who switch to plant-based dairy and meat products.
 Conclusions & ImplicationsOur research indicates that dietary changes, such as switching from purchasing regular milk to alternative milks, may lead to insufficient intake of essential dietary nutrients such as iodine. This represents a significant potential health concern for the public if not remediated, especially in countries that do not require salt to be fortified with iodine.","PeriodicalId":132937,"journal":{"name":"International Journal for Population Data Science","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135154235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-18DOI: 10.23889/ijpds.v8i3.2273
Elizabeth Dolan, James Goulding, Anya Skatova
Introduction & BackgroundPrevious studies have found shopping data could increase the predictive accuracy of disease surveillance systems and illuminate behavioural responses in the self-management of symptoms of disease. Yet, accessing individual sales datasets for linkage to health datasets is challenging, and the recruitment of appropriate sample sizes for medical research has been limited.
Objectives & ApproachObjectivesCollect and link individual health data to individual shopping data to investigate COVID-19. Assess the feasibility of scaling-up this method, and use the collected data to investigate using loyalty card data in machine learning (ML) models for disease.
MethodsBased on recommendations on the public’s preferences for data donation a new protocol was designed for collecting, linking and analysing shopping and health data. Participants were requested to use the Tesco Clubcard website data portability function to share their loyalty card data and complete an online health survey. An exploratory data analysis was conducted on the linked dataset. Participants were recruited online (18/01/2022 to 04/02/2022) with a recruitment target of 200.
Relevance to Digital FootprintsThe collection and analysis of individual transactional sales data for health research.
Results197 participants shared their Tesco Clubcard and health survey data. Tesco Clubcard data contained 893,414 transactions of 65,310 uniquely named items purchased from 2015 to 2022. Average transactions per participant were 4,653 (SD 5256) and average timeframe recorded was five years 6 months and 30 days (SD 836 days). A total of 6,993 medication sales were recorded accounting for 1% of sales, 81% (159/197) of participants bought medications and the average was 44 (STD 68) medications per individual. Most participants (196/197) shared their health status in the survey, and 94% (81/86) of those on medication shared the medication names. Participants reported donating their data to do good (79%, 155/197), help the NHS (77%, 152/197), be socially responsible (74%, 144/197) and because data was secure and anonymised (78%, 153/197).
Conclusions & ImplicationsUsing this new protocol which enables convenient data sharing with transparent data safeguards, the public were willing to share both their shopping and health data for research into COVID-19. To apply robust ML analysis, particularly to explore self-medication at an individual level, recruitment must be significantly scaled to collect data from enough individuals with high sales and regular shopping frequency, or new ML techniques developed to address sparseness in loyalty card data of key purchasing events related to health. The study suggests public readiness to share shopping data for health research, but investment is needed for large-scale data collection and AI application.
{"title":"Data donation of individual shopping data to help predict the occurrence of disease: A pilot study linking individual loyalty card and health survey data to investigate COVID-19","authors":"Elizabeth Dolan, James Goulding, Anya Skatova","doi":"10.23889/ijpds.v8i3.2273","DOIUrl":"https://doi.org/10.23889/ijpds.v8i3.2273","url":null,"abstract":"Introduction & BackgroundPrevious studies have found shopping data could increase the predictive accuracy of disease surveillance systems and illuminate behavioural responses in the self-management of symptoms of disease. Yet, accessing individual sales datasets for linkage to health datasets is challenging, and the recruitment of appropriate sample sizes for medical research has been limited.
 Objectives & ApproachObjectivesCollect and link individual health data to individual shopping data to investigate COVID-19. Assess the feasibility of scaling-up this method, and use the collected data to investigate using loyalty card data in machine learning (ML) models for disease.
 MethodsBased on recommendations on the public’s preferences for data donation a new protocol was designed for collecting, linking and analysing shopping and health data. Participants were requested to use the Tesco Clubcard website data portability function to share their loyalty card data and complete an online health survey. An exploratory data analysis was conducted on the linked dataset. Participants were recruited online (18/01/2022 to 04/02/2022) with a recruitment target of 200.
 Relevance to Digital FootprintsThe collection and analysis of individual transactional sales data for health research.
 Results197 participants shared their Tesco Clubcard and health survey data. Tesco Clubcard data contained 893,414 transactions of 65,310 uniquely named items purchased from 2015 to 2022. Average transactions per participant were 4,653 (SD 5256) and average timeframe recorded was five years 6 months and 30 days (SD 836 days). A total of 6,993 medication sales were recorded accounting for 1% of sales, 81% (159/197) of participants bought medications and the average was 44 (STD 68) medications per individual. Most participants (196/197) shared their health status in the survey, and 94% (81/86) of those on medication shared the medication names. Participants reported donating their data to do good (79%, 155/197), help the NHS (77%, 152/197), be socially responsible (74%, 144/197) and because data was secure and anonymised (78%, 153/197).
 Conclusions & ImplicationsUsing this new protocol which enables convenient data sharing with transparent data safeguards, the public were willing to share both their shopping and health data for research into COVID-19. To apply robust ML analysis, particularly to explore self-medication at an individual level, recruitment must be significantly scaled to collect data from enough individuals with high sales and regular shopping frequency, or new ML techniques developed to address sparseness in loyalty card data of key purchasing events related to health. The study suggests public readiness to share shopping data for health research, but investment is needed for large-scale data collection and AI application.","PeriodicalId":132937,"journal":{"name":"International Journal for Population Data Science","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135154243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}