Analee J Etheredge, Samuel Hosmer, Aldo Crossa, Rachel Suss, Mark Torrey
{"title":"食品店的名字里有什么?利用大型语言模型增强食品环境数据。","authors":"Analee J Etheredge, Samuel Hosmer, Aldo Crossa, Rachel Suss, Mark Torrey","doi":"10.3389/frai.2024.1476950","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>It is not uncommon to repurpose administrative food data to create food environment datasets in the health department and research settings; however, the available administrative data are rarely categorized in a way that supports meaningful insight or action, and ground-truthing or manually reviewing an entire city or neighborhood is rate-limiting to essential operations and analysis. We show that such categorizations should be viewed as a classification problem well addressed by recent advances in natural language processing and deep learning-with the advent of large language models (LLMs).</p><p><strong>Methods: </strong>To demonstrate how to automate the process of categorizing food stores, we use the foundation model BERT to give a first approximation to such categorizations: a best guess by store name. First, 10 food retail classes were developed to comprehensively categorize food store types from a public health perspective.</p><p><strong>Results: </strong>Based on this rubric, the model was tuned and evaluated (F1<sub>micro</sub> = 0.710, F1<sub>macro</sub> = 0.709) on an extensive storefront directory of New York City. Second, the model was applied to infer insights from a large, unlabeled dataset using store names alone, aiming to replicate known temporospatial patterns. Finally, a complimentary application of the model as a data quality enhancement tool was demonstrated on a secondary, pre-labeled restaurant dataset.</p><p><strong>Discussion: </strong>This novel application of an LLM to the enumeration of the food environment allowed for marked gains in efficiency compared to manual, in-person methods, addressing a known challenge to research and operations in a local health department.</p>","PeriodicalId":33315,"journal":{"name":"Frontiers in Artificial Intelligence","volume":"7 ","pages":"1476950"},"PeriodicalIF":3.0000,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11660183/pdf/","citationCount":"0","resultStr":"{\"title\":\"What is in a food store name? Leveraging large language models to enhance food environment data.\",\"authors\":\"Analee J Etheredge, Samuel Hosmer, Aldo Crossa, Rachel Suss, Mark Torrey\",\"doi\":\"10.3389/frai.2024.1476950\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction: </strong>It is not uncommon to repurpose administrative food data to create food environment datasets in the health department and research settings; however, the available administrative data are rarely categorized in a way that supports meaningful insight or action, and ground-truthing or manually reviewing an entire city or neighborhood is rate-limiting to essential operations and analysis. We show that such categorizations should be viewed as a classification problem well addressed by recent advances in natural language processing and deep learning-with the advent of large language models (LLMs).</p><p><strong>Methods: </strong>To demonstrate how to automate the process of categorizing food stores, we use the foundation model BERT to give a first approximation to such categorizations: a best guess by store name. First, 10 food retail classes were developed to comprehensively categorize food store types from a public health perspective.</p><p><strong>Results: </strong>Based on this rubric, the model was tuned and evaluated (F1<sub>micro</sub> = 0.710, F1<sub>macro</sub> = 0.709) on an extensive storefront directory of New York City. Second, the model was applied to infer insights from a large, unlabeled dataset using store names alone, aiming to replicate known temporospatial patterns. Finally, a complimentary application of the model as a data quality enhancement tool was demonstrated on a secondary, pre-labeled restaurant dataset.</p><p><strong>Discussion: </strong>This novel application of an LLM to the enumeration of the food environment allowed for marked gains in efficiency compared to manual, in-person methods, addressing a known challenge to research and operations in a local health department.</p>\",\"PeriodicalId\":33315,\"journal\":{\"name\":\"Frontiers in Artificial Intelligence\",\"volume\":\"7 \",\"pages\":\"1476950\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-12-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11660183/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/frai.2024.1476950\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frai.2024.1476950","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
What is in a food store name? Leveraging large language models to enhance food environment data.
Introduction: It is not uncommon to repurpose administrative food data to create food environment datasets in the health department and research settings; however, the available administrative data are rarely categorized in a way that supports meaningful insight or action, and ground-truthing or manually reviewing an entire city or neighborhood is rate-limiting to essential operations and analysis. We show that such categorizations should be viewed as a classification problem well addressed by recent advances in natural language processing and deep learning-with the advent of large language models (LLMs).
Methods: To demonstrate how to automate the process of categorizing food stores, we use the foundation model BERT to give a first approximation to such categorizations: a best guess by store name. First, 10 food retail classes were developed to comprehensively categorize food store types from a public health perspective.
Results: Based on this rubric, the model was tuned and evaluated (F1micro = 0.710, F1macro = 0.709) on an extensive storefront directory of New York City. Second, the model was applied to infer insights from a large, unlabeled dataset using store names alone, aiming to replicate known temporospatial patterns. Finally, a complimentary application of the model as a data quality enhancement tool was demonstrated on a secondary, pre-labeled restaurant dataset.
Discussion: This novel application of an LLM to the enumeration of the food environment allowed for marked gains in efficiency compared to manual, in-person methods, addressing a known challenge to research and operations in a local health department.