{"title":"推广联系管道(GLADIS),以促进为公众利益而进行的研究","authors":"Pratibha Vellanki, Mary Cleaton","doi":"10.23889/ijpds.v8i2.2219","DOIUrl":null,"url":null,"abstract":"ObjectivesThe Integrated Data Service (IDS) is a new cross-government service that facilitates research for the public good. Key to its success are Integrated Data Assets (IDAs): de-identified, grouped datasets that are joinable on an artificial ID and themed on a given topic. The Demographic Index (DI) comprises five linked administrative datasets. We are developing a generalisable method that will link administrative and survey datasets to the DI via a customisable, reproducible pipeline, to produce IDAs.
 MethodsThe method focuses on the traditional methodologies of deterministic and probabilistic data linkage and uses the Splink implementation of the Fellegi-Sunter method for probabilistic matching. The pipeline will include a tool for quality-assurance (QA) via clerical review.
 We are researching a generalisable implementation of Splink, deriving the method’s control parameters using the results of the deterministic matching. Additionally, we are researching application of Locality Sensitive Hashing (LSH), a dimensionality-reduction method suggested to improve computational efficiency, for blocking. This is especially important due to the large size of the datasets involved.
 ResultsWe plan to produce linked datasets with three quality levels – prioritising precision, balancing precision and recall and prioritising recall. As the datasets are always linked to the DI, the DI’s artificial ID can be used as a ‘spine’ to bring them together as assets (IDAs).
 Initially, the method will be used on the 2021 England and Wales Census. Despite not including clerical matching in the method (except for quality-assurance), we anticipate a high precision and recall due to the quality of the Census and the number of linkage variables available. Thereafter, we plan for user testing with other datasets, including the Labour Market Survey.
 ConclusionOur generalisable linkage pipeline for the DI will, through its IDA outputs, facilitate research for the public good. This research will directly impact government policy and responses to national health emergencies, including Covid-19, and support government priorities such as Levelling Up and the transition towards Net Zero.","PeriodicalId":132937,"journal":{"name":"International Journal for Population Data Science","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A generalisable linkage pipeline (GLADIS) to facilitate research for the public good\",\"authors\":\"Pratibha Vellanki, Mary Cleaton\",\"doi\":\"10.23889/ijpds.v8i2.2219\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ObjectivesThe Integrated Data Service (IDS) is a new cross-government service that facilitates research for the public good. Key to its success are Integrated Data Assets (IDAs): de-identified, grouped datasets that are joinable on an artificial ID and themed on a given topic. The Demographic Index (DI) comprises five linked administrative datasets. We are developing a generalisable method that will link administrative and survey datasets to the DI via a customisable, reproducible pipeline, to produce IDAs.
 MethodsThe method focuses on the traditional methodologies of deterministic and probabilistic data linkage and uses the Splink implementation of the Fellegi-Sunter method for probabilistic matching. The pipeline will include a tool for quality-assurance (QA) via clerical review.
 We are researching a generalisable implementation of Splink, deriving the method’s control parameters using the results of the deterministic matching. Additionally, we are researching application of Locality Sensitive Hashing (LSH), a dimensionality-reduction method suggested to improve computational efficiency, for blocking. This is especially important due to the large size of the datasets involved.
 ResultsWe plan to produce linked datasets with three quality levels – prioritising precision, balancing precision and recall and prioritising recall. As the datasets are always linked to the DI, the DI’s artificial ID can be used as a ‘spine’ to bring them together as assets (IDAs).
 Initially, the method will be used on the 2021 England and Wales Census. Despite not including clerical matching in the method (except for quality-assurance), we anticipate a high precision and recall due to the quality of the Census and the number of linkage variables available. Thereafter, we plan for user testing with other datasets, including the Labour Market Survey.
 ConclusionOur generalisable linkage pipeline for the DI will, through its IDA outputs, facilitate research for the public good. This research will directly impact government policy and responses to national health emergencies, including Covid-19, and support government priorities such as Levelling Up and the transition towards Net Zero.\",\"PeriodicalId\":132937,\"journal\":{\"name\":\"International Journal for Population Data Science\",\"volume\":\"85 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal for Population Data Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23889/ijpds.v8i2.2219\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal for Population Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23889/ijpds.v8i2.2219","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A generalisable linkage pipeline (GLADIS) to facilitate research for the public good
ObjectivesThe Integrated Data Service (IDS) is a new cross-government service that facilitates research for the public good. Key to its success are Integrated Data Assets (IDAs): de-identified, grouped datasets that are joinable on an artificial ID and themed on a given topic. The Demographic Index (DI) comprises five linked administrative datasets. We are developing a generalisable method that will link administrative and survey datasets to the DI via a customisable, reproducible pipeline, to produce IDAs.
MethodsThe method focuses on the traditional methodologies of deterministic and probabilistic data linkage and uses the Splink implementation of the Fellegi-Sunter method for probabilistic matching. The pipeline will include a tool for quality-assurance (QA) via clerical review.
We are researching a generalisable implementation of Splink, deriving the method’s control parameters using the results of the deterministic matching. Additionally, we are researching application of Locality Sensitive Hashing (LSH), a dimensionality-reduction method suggested to improve computational efficiency, for blocking. This is especially important due to the large size of the datasets involved.
ResultsWe plan to produce linked datasets with three quality levels – prioritising precision, balancing precision and recall and prioritising recall. As the datasets are always linked to the DI, the DI’s artificial ID can be used as a ‘spine’ to bring them together as assets (IDAs).
Initially, the method will be used on the 2021 England and Wales Census. Despite not including clerical matching in the method (except for quality-assurance), we anticipate a high precision and recall due to the quality of the Census and the number of linkage variables available. Thereafter, we plan for user testing with other datasets, including the Labour Market Survey.
ConclusionOur generalisable linkage pipeline for the DI will, through its IDA outputs, facilitate research for the public good. This research will directly impact government policy and responses to national health emergencies, including Covid-19, and support government priorities such as Levelling Up and the transition towards Net Zero.