首页 > 最新文献

Frontiers in Big Data最新文献

英文 中文
Sparse and Expandable Network for Google's Pathways. 谷歌 Pathways 的稀疏可扩展网络。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-29 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1348030
Charles X Ling, Ganyu Wang, Boyu Wang

Introduction: Recently, Google introduced Pathways as its next-generation AI architecture. Pathways must address three critical challenges: learning one general model for several continuous tasks, ensuring tasks can leverage each other without forgetting old tasks, and learning from multi-modal data such as images and audio. Additionally, Pathways must maintain sparsity in both learning and deployment. Current lifelong multi-task learning approaches are inadequate in addressing these challenges.

Methods: To address these challenges, we propose SEN, a Sparse and Expandable Network. SEN is designed to handle multiple tasks concurrently by maintaining sparsity and enabling expansion when new tasks are introduced. The network leverages multi-modal data, integrating information from different sources while preventing interference between tasks.

Results: The proposed SEN model demonstrates significant improvements in multi-task learning, successfully managing task interference and forgetting. It effectively integrates data from various modalities and maintains efficiency through sparsity during both the learning and deployment phases.

Discussion: SEN offers a straightforward yet effective solution to the limitations of current lifelong multi-task learning methods. By addressing the challenges identified in the Pathways architecture, SEN provides a promising approach for developing AI systems capable of learning and adapting over time without sacrificing performance or efficiency.

简介最近,谷歌推出了下一代人工智能架构 Pathways。Pathways 必须解决三个关键挑战:为多个连续任务学习一个通用模型;确保任务之间可以相互利用,同时不遗忘旧任务;从图像和音频等多模态数据中学习。此外,Pathways 还必须在学习和部署过程中保持稀疏性。目前的终身多任务学习方法不足以应对这些挑战:为了应对这些挑战,我们提出了稀疏可扩展网络 SEN。SEN 的设计目的是通过保持稀疏性来同时处理多个任务,并在引入新任务时实现扩展。该网络利用多模态数据,整合来自不同来源的信息,同时防止任务之间的干扰:结果:所提出的 SEN 模型在多任务学习方面有显著改进,成功地管理了任务干扰和遗忘。它有效整合了各种模式的数据,并在学习和部署阶段通过稀疏性保持了效率:SEN 为解决当前终身多任务学习方法的局限性提供了一个简单而有效的解决方案。通过解决 Pathways 架构中发现的挑战,SEN 为开发能够在不牺牲性能或效率的情况下进行长期学习和适应的人工智能系统提供了一种前景广阔的方法。
{"title":"Sparse and Expandable Network for Google's Pathways.","authors":"Charles X Ling, Ganyu Wang, Boyu Wang","doi":"10.3389/fdata.2024.1348030","DOIUrl":"https://doi.org/10.3389/fdata.2024.1348030","url":null,"abstract":"<p><strong>Introduction: </strong>Recently, Google introduced Pathways as its next-generation AI architecture. Pathways must address three critical challenges: learning one general model for several continuous tasks, ensuring tasks can leverage each other without forgetting old tasks, and learning from multi-modal data such as images and audio. Additionally, Pathways must maintain sparsity in both learning and deployment. Current lifelong multi-task learning approaches are inadequate in addressing these challenges.</p><p><strong>Methods: </strong>To address these challenges, we propose SEN, a Sparse and Expandable Network. SEN is designed to handle multiple tasks concurrently by maintaining sparsity and enabling expansion when new tasks are introduced. The network leverages multi-modal data, integrating information from different sources while preventing interference between tasks.</p><p><strong>Results: </strong>The proposed SEN model demonstrates significant improvements in multi-task learning, successfully managing task interference and forgetting. It effectively integrates data from various modalities and maintains efficiency through sparsity during both the learning and deployment phases.</p><p><strong>Discussion: </strong>SEN offers a straightforward yet effective solution to the limitations of current lifelong multi-task learning methods. By addressing the challenges identified in the Pathways architecture, SEN provides a promising approach for developing AI systems capable of learning and adapting over time without sacrificing performance or efficiency.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1348030"},"PeriodicalIF":2.4,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11390433/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142300699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient use of binned data for imputing univariate time series data. 有效利用二进制数据归因单变量时间序列数据。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-21 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1422650
Jay Darji, Nupur Biswas, Vijay Padul, Jaya Gill, Santosh Kesari, Shashaanka Ashili

Time series data are recorded in various sectors, resulting in a large amount of data. However, the continuity of these data is often interrupted, resulting in periods of missing data. Several algorithms are used to impute the missing data, and the performance of these methods is widely varied. Apart from the choice of algorithm, the effective imputation depends on the nature of missing and available data. We conducted extensive studies using different types of time series data, specifically heart rate data and power consumption data. We generated the missing data for different time spans and imputed using different algorithms with binned data of different sizes. The performance was evaluated using the root mean square error (RMSE) metric. We observed a reduction in RMSE when using binned data compared to the entire dataset, particularly in the case of the expectation-maximization (EM) algorithm. We found that RMSE was reduced when using binned data for 1-, 5-, and 15-min missing data, with greater reduction observed for 15-min missing data. We also observed the effect of data fluctuation. We conclude that the usefulness of binned data depends precisely on the span of missing data, sampling frequency of the data, and fluctuation within data. Depending on the inherent characteristics, quality, and quantity of the missing and available data, binned data can impute a wide variety of data, including biological heart rate data derived from the Internet of Things (IoT) device smartwatch and non-biological data such as household power consumption data.

时间序列数据记录在各个部门,从而产生了大量数据。然而,这些数据的连续性经常会中断,从而导致数据缺失。有几种算法可用于缺失数据的估算,这些方法的性能差别很大。除了算法的选择,有效的估算还取决于缺失数据和可用数据的性质。我们利用不同类型的时间序列数据,特别是心率数据和耗电量数据,进行了广泛的研究。我们生成了不同时间跨度的缺失数据,并使用不同的算法对不同大小的二进制数据进行了估算。使用均方根误差 (RMSE) 指标对性能进行评估。我们观察到,与整个数据集相比,使用二进制数据时 RMSE 有所降低,尤其是在期望最大化(EM)算法中。我们发现,在对 1、5 和 15 分钟的缺失数据使用二进制数据时,RMSE 都有所降低,其中 15 分钟缺失数据的 RMSE 降低幅度更大。我们还观察到了数据波动的影响。我们的结论是,二进制数据的实用性恰恰取决于缺失数据的跨度、数据的采样频率以及数据内部的波动。根据缺失数据和可用数据的固有特征、质量和数量,二进制数据可以替代多种数据,包括从物联网(IoT)设备智能手表中提取的生物心率数据和非生物数据,如家庭用电量数据。
{"title":"Efficient use of binned data for imputing univariate time series data.","authors":"Jay Darji, Nupur Biswas, Vijay Padul, Jaya Gill, Santosh Kesari, Shashaanka Ashili","doi":"10.3389/fdata.2024.1422650","DOIUrl":"10.3389/fdata.2024.1422650","url":null,"abstract":"<p><p>Time series data are recorded in various sectors, resulting in a large amount of data. However, the continuity of these data is often interrupted, resulting in periods of missing data. Several algorithms are used to impute the missing data, and the performance of these methods is widely varied. Apart from the choice of algorithm, the effective imputation depends on the nature of missing and available data. We conducted extensive studies using different types of time series data, specifically heart rate data and power consumption data. We generated the missing data for different time spans and imputed using different algorithms with binned data of different sizes. The performance was evaluated using the root mean square error (RMSE) metric. We observed a reduction in RMSE when using binned data compared to the entire dataset, particularly in the case of the expectation-maximization (EM) algorithm. We found that RMSE was reduced when using binned data for 1-, 5-, and 15-min missing data, with greater reduction observed for 15-min missing data. We also observed the effect of data fluctuation. We conclude that the usefulness of binned data depends precisely on the span of missing data, sampling frequency of the data, and fluctuation within data. Depending on the inherent characteristics, quality, and quantity of the missing and available data, binned data can impute a wide variety of data, including biological heart rate data derived from the Internet of Things (IoT) device smartwatch and non-biological data such as household power consumption data.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1422650"},"PeriodicalIF":2.4,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11371617/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142134416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Equitable differential privacy. 公平的差别隐私。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-16 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1420344
Vasundhara Kaul, Tamalika Mukherjee

Differential privacy (DP) has been in the public spotlight since the announcement of its use in the 2020 U.S. Census. While DP algorithms have substantially improved the confidentiality protections provided to Census respondents, concerns have been raised about the accuracy of the DP-protected Census data. The extent to which the use of DP distorts the ability to draw inferences that drive policy about small-populations, especially marginalized communities, has been of particular concern to researchers and policy makers. After all, inaccurate information about marginalized populations can often engender policies that exacerbate rather than ameliorate social inequities. Consequently, computer science experts have focused on developing mechanisms that help achieve equitable privacy, i.e., mechanisms that mitigate the data distortions introduced by privacy protections to ensure equitable outcomes and benefits for all groups, particularly marginalized groups. Our paper extends the conversation on equitable privacy by highlighting the importance of inclusive communication in ensuring equitable outcomes for all social groups through all the stages of deploying a differentially private system. We conceptualize Equitable DP as the design, communication, and implementation of DP algorithms that ensure equitable outcomes. Thus, in addition to adopting computer scientists' recommendations of incorporating equity parameters within DP algorithms, we suggest that it is critical for an organization to also facilitate inclusive communication throughout the design, development, and implementation stages of a DP algorithm to ensure it has an equitable impact on social groups and does not hinder the redressal of social inequities. To demonstrate the importance of communication for Equitable DP, we undertake a case study of the process through which DP was adopted as the newest disclosure avoidance system for the 2020 U.S. Census. Drawing on the Inclusive Science Communication (ISC) framework, we examine the extent to which the Census Bureau's communication strategies encouraged engagement across the diverse groups of users that employ the decennial Census data for research and policy making. Our analysis provides lessons that can be used by other government organizations interested in incorporating the Equitable DP approach in their data collection practices.

自从宣布在 2020 年美国人口普查中使用差分隐私 (DP) 后,它一直是公众关注的焦点。虽然 DP 算法大大提高了对人口普查受访者的保密保护,但受 DP 保护的人口普查数据的准确性也引起了关注。研究人员和政策制定者尤其关注的是,DP 的使用在多大程度上扭曲了对小群体,尤其是边缘化群体进行推论以推动政策制定的能力。毕竟,关于边缘化人群的不准确信息往往会导致政策加剧而非改善社会不平等。因此,计算机科学专家专注于开发有助于实现公平隐私的机制,即减轻隐私保护带来的数据扭曲的机制,以确保所有群体,特别是边缘化群体获得公平的结果和利益。我们的论文通过强调包容性交流在确保所有社会群体在部署差异化隐私系统的所有阶段都能获得公平结果方面的重要性,扩展了有关公平隐私的讨论。我们将公平 DP 概念化为确保公平结果的 DP 算法的设计、交流和实施。因此,除了采纳计算机科学家关于在 DP 算法中纳入公平参数的建议外,我们还建议组织在 DP 算法的整个设计、开发和实施阶段促进包容性沟通,以确保其对社会群体产生公平影响,且不妨碍纠正社会不公平现象,这一点至关重要。为了证明沟通对于公平 DP 的重要性,我们对 DP 被采纳为 2020 年美国人口普查最新的信息披露规避系统的过程进行了案例研究。借鉴包容性科学交流(ISC)框架,我们研究了人口普查局的交流策略在多大程度上鼓励了使用十年一次的人口普查数据进行研究和决策的不同用户群体的参与。我们的分析为其他有意将公平 DP 方法纳入其数据收集实践的政府组织提供了可借鉴的经验。
{"title":"Equitable differential privacy.","authors":"Vasundhara Kaul, Tamalika Mukherjee","doi":"10.3389/fdata.2024.1420344","DOIUrl":"10.3389/fdata.2024.1420344","url":null,"abstract":"<p><p>Differential privacy (DP) has been in the public spotlight since the announcement of its use in the 2020 U.S. Census. While DP algorithms have substantially improved the confidentiality protections provided to Census respondents, concerns have been raised about the accuracy of the DP-protected Census data. The extent to which the use of DP distorts the ability to draw inferences that drive policy about small-populations, especially marginalized communities, has been of particular concern to researchers and policy makers. After all, inaccurate information about marginalized populations can often engender policies that exacerbate rather than ameliorate social inequities. Consequently, computer science experts have focused on developing mechanisms that help achieve equitable privacy, i.e., mechanisms that mitigate the data distortions introduced by privacy protections to ensure equitable outcomes and benefits for all groups, particularly marginalized groups. Our paper extends the conversation on equitable privacy by highlighting the importance of inclusive communication in ensuring equitable outcomes for all social groups through all the stages of deploying a differentially private system. We conceptualize Equitable DP as the design, communication, and implementation of DP algorithms that ensure equitable outcomes. Thus, in addition to adopting computer scientists' recommendations of incorporating equity parameters within DP algorithms, we suggest that it is critical for an organization to also facilitate inclusive communication throughout the design, development, and implementation stages of a DP algorithm to ensure it has an equitable impact on social groups and does not hinder the redressal of social inequities. To demonstrate the importance of communication for Equitable DP, we undertake a case study of the process through which DP was adopted as the newest disclosure avoidance system for the 2020 U.S. Census. Drawing on the Inclusive Science Communication (ISC) framework, we examine the extent to which the Census Bureau's communication strategies encouraged engagement across the diverse groups of users that employ the decennial Census data for research and policy making. Our analysis provides lessons that can be used by other government organizations interested in incorporating the Equitable DP approach in their data collection practices.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1420344"},"PeriodicalIF":2.4,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11363707/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142114688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data science's cultural construction: qualitative ideas for quantitative work. 数据科学的文化构建:定量工作的定性思想。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-14 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1287442
Philipp Brandt

Introduction: "Data scientists" quickly became ubiquitous, often infamously so, but they have struggled with the ambiguity of their novel role. This article studies data science's collective definition on Twitter.

Methods: The analysis responds to the challenges of studying an emergent case with unclear boundaries and substance through a cultural perspective and complementary datasets ranging from 1,025 to 752,815 tweets. It brings together relations between accounts that tweeted about data science, the hashtags they used, indicating purposes, and the topics they discussed.

Results: The first results reproduce familiar commercial and technical motives. Additional results reveal concerns with new practical and ethical standards as a distinctive motive for constructing data science.

Discussion: The article provides a sensibility for local meaning in usually abstract datasets and a heuristic for navigating increasingly abundant datasets toward surprising insights. For data scientists, it offers a guide for positioning themselves vis-à-vis others to navigate their professional future.

导言:"数据科学家 "很快就变得无处不在,而且常常臭名昭著,但他们一直在为新角色的模糊性而挣扎。本文研究了推特上对数据科学的集体定义:分析方法:本文通过文化视角和 1,025 至 752,815 条推文的互补数据集,应对了研究边界不清、内容不明的新兴案例所面临的挑战。它汇集了在推特上谈论数据科学的账户之间的关系、他们使用的标签、表明的目的以及他们讨论的主题:第一批结果再现了人们熟悉的商业和技术动机。其他结果显示,对新的实用和道德标准的关注是构建数据科学的一个独特动机:这篇文章为通常抽象的数据集提供了局部意义的感性认识,也为浏览日益丰富的数据集以获得惊人的洞察力提供了启发。对于数据科学家来说,这篇文章为他们提供了一个指南,帮助他们定位自己与他人的关系,从而为自己的职业未来导航。
{"title":"Data science's cultural construction: qualitative ideas for quantitative work.","authors":"Philipp Brandt","doi":"10.3389/fdata.2024.1287442","DOIUrl":"https://doi.org/10.3389/fdata.2024.1287442","url":null,"abstract":"<p><strong>Introduction: </strong>\"Data scientists\" quickly became ubiquitous, often infamously so, but they have struggled with the ambiguity of their novel role. This article studies data science's collective definition on Twitter.</p><p><strong>Methods: </strong>The analysis responds to the challenges of studying an emergent case with unclear boundaries and substance through a cultural perspective and complementary datasets ranging from 1,025 to 752,815 tweets. It brings together relations between accounts that tweeted about data science, the hashtags they used, indicating purposes, and the topics they discussed.</p><p><strong>Results: </strong>The first results reproduce familiar commercial and technical motives. Additional results reveal concerns with new practical and ethical standards as a distinctive motive for constructing data science.</p><p><strong>Discussion: </strong>The article provides a sensibility for local meaning in usually abstract datasets and a heuristic for navigating increasingly abundant datasets toward surprising insights. For data scientists, it offers a guide for positioning themselves vis-à-vis others to navigate their professional future.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1287442"},"PeriodicalIF":2.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11349665/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142114687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The development and application of a novel E-commerce recommendation system used in electric power B2B sector. 新型电子商务推荐系统在电力 B2B 行业的开发与应用。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-07-31 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1374980
Wenjun Meng, Lili Chen, Zhaomin Dong

The advent of the digital era has transformed E-commerce platforms into critical tools for industry, yet traditional recommendation systems often fall short in the specialized context of the electric power industry. These systems typically struggle with the industry's unique challenges, such as infrequent and high-stakes transactions, prolonged decision-making processes, and sparse data. This research has developed a novel recommendation engine tailored to these specific conditions, such as to handle the low frequency and long cycle nature of Business-to-Business (B2B) transactions. This approach includes algorithmic enhancements to better process and interpret the limited data available, and data pre-processing techniques designed to enrich the sparse datasets characteristic of this industry. This research also introduces a methodological innovation that integrates multi-dimensional data, combining user E-commerce activities, product specifics, and essential non-tendering information. The proposed engine employs advanced machine learning techniques to provide more accurate and relevant recommendations. The results demonstrate a marked improvement over traditional models, offering a more robust and effective tool for facilitating B2B transactions in the electric power industry. This research not only addresses the sector's unique challenges but also provides a blueprint for adapting recommendation systems to other industries with similar B2B characteristics.

数字时代的到来已将电子商务平台转变为工业领域的重要工具,但传统的推荐系统在电力行业的专业背景下往往显得力不从心。这些系统通常难以应对电力行业的独特挑战,如交易频率低、风险高、决策过程漫长、数据稀少等。本研究针对这些特定条件开发了一种新颖的推荐引擎,例如处理企业对企业 (B2B) 交易的低频率和长周期特性。这种方法包括算法改进,以更好地处理和解释有限的可用数据,以及旨在丰富该行业特有的稀疏数据集的数据预处理技术。这项研究还引入了一种方法创新,它整合了多维数据,将用户电子商务活动、产品细节和基本的非招标信息结合在一起。所提出的引擎采用了先进的机器学习技术,以提供更准确、更相关的推荐。研究结果表明,与传统模型相比,该引擎有了明显改善,为促进电力行业的 B2B 交易提供了更强大、更有效的工具。这项研究不仅解决了该行业面临的独特挑战,还为具有类似 B2B 特征的其他行业调整推荐系统提供了蓝图。
{"title":"The development and application of a novel E-commerce recommendation system used in electric power B2B sector.","authors":"Wenjun Meng, Lili Chen, Zhaomin Dong","doi":"10.3389/fdata.2024.1374980","DOIUrl":"https://doi.org/10.3389/fdata.2024.1374980","url":null,"abstract":"<p><p>The advent of the digital era has transformed E-commerce platforms into critical tools for industry, yet traditional recommendation systems often fall short in the specialized context of the electric power industry. These systems typically struggle with the industry's unique challenges, such as infrequent and high-stakes transactions, prolonged decision-making processes, and sparse data. This research has developed a novel recommendation engine tailored to these specific conditions, such as to handle the low frequency and long cycle nature of Business-to-Business (B2B) transactions. This approach includes algorithmic enhancements to better process and interpret the limited data available, and data pre-processing techniques designed to enrich the sparse datasets characteristic of this industry. This research also introduces a methodological innovation that integrates multi-dimensional data, combining user E-commerce activities, product specifics, and essential non-tendering information. The proposed engine employs advanced machine learning techniques to provide more accurate and relevant recommendations. The results demonstrate a marked improvement over traditional models, offering a more robust and effective tool for facilitating B2B transactions in the electric power industry. This research not only addresses the sector's unique challenges but also provides a blueprint for adapting recommendation systems to other industries with similar B2B characteristics.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1374980"},"PeriodicalIF":2.4,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11322496/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141983886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient enhancement of low-rank tensor completion via thin QR decomposition. 通过薄 QR 分解有效增强低等级张量补全。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-07-02 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1382144
Yan Wu, Yunzhi Jin

Low-rank tensor completion (LRTC), which aims to complete missing entries from tensors with partially observed terms by utilizing the low-rank structure of tensors, has been widely used in various real-world issues. The core tensor nuclear norm minimization (CTNM) method based on Tucker decomposition is one of common LRTC methods. However, the CTNM methods based on Tucker decomposition often have a large computing cost due to the fact that the general factor matrix solving technique involves multiple singular value decompositions (SVDs) in each loop. To address this problem, this article enhances the method and proposes an effective CTNM method based on thin QR decomposition (CTNM-QR) with lower computing complexity. The proposed method extends the CTNM by introducing tensor versions of the auxiliary variables instead of matrices, while using the thin QR decomposition to solve the factor matrix rather than the SVD, which can save the computational complexity and improve the tensor completion accuracy. In addition, the CTNM-QR method's convergence and complexity are analyzed further. Numerous experiments in synthetic data, real color images, and brain MRI data at different missing rates demonstrate that the proposed method not only outperforms in terms of completion accuracy and visualization, but also conducts more efficiently than most state-of-the-art LRTC methods.

低秩张量补全(LRTC)旨在利用张量的低秩结构,补全张量中部分观测项的缺失项,已被广泛应用于各种实际问题中。基于塔克分解的核心张量核规范最小化(CTNM)方法是常见的 LRTC 方法之一。然而,由于一般的因子矩阵求解技术在每个循环中都要进行多次奇异值分解(SVD),因此基于 Tucker 分解的 CTNM 方法通常计算成本较高。针对这一问题,本文对该方法进行了改进,提出了一种计算复杂度更低的基于薄 QR 分解的有效 CTNM 方法(CTNM-QR)。该方法通过引入辅助变量的张量版本而不是矩阵来扩展 CTNM,同时使用薄 QR 分解而不是 SVD 来求解因子矩阵,从而节省了计算复杂度并提高了张量补全精度。此外,还进一步分析了 CTNM-QR 方法的收敛性和复杂性。在合成数据、真实彩色图像和不同缺失率的脑磁共振成像数据中进行的大量实验表明,所提出的方法不仅在补全精度和可视化方面表现出色,而且比大多数最先进的 LRTC 方法更高效。
{"title":"Efficient enhancement of low-rank tensor completion via thin QR decomposition.","authors":"Yan Wu, Yunzhi Jin","doi":"10.3389/fdata.2024.1382144","DOIUrl":"10.3389/fdata.2024.1382144","url":null,"abstract":"<p><p>Low-rank tensor completion (LRTC), which aims to complete missing entries from tensors with partially observed terms by utilizing the low-rank structure of tensors, has been widely used in various real-world issues. The core tensor nuclear norm minimization (CTNM) method based on Tucker decomposition is one of common LRTC methods. However, the CTNM methods based on Tucker decomposition often have a large computing cost due to the fact that the general factor matrix solving technique involves multiple singular value decompositions (SVDs) in each loop. To address this problem, this article enhances the method and proposes an effective CTNM method based on thin QR decomposition (CTNM-QR) with lower computing complexity. The proposed method extends the CTNM by introducing tensor versions of the auxiliary variables instead of matrices, while using the thin QR decomposition to solve the factor matrix rather than the SVD, which can save the computational complexity and improve the tensor completion accuracy. In addition, the CTNM-QR method's convergence and complexity are analyzed further. Numerous experiments in synthetic data, real color images, and brain MRI data at different missing rates demonstrate that the proposed method not only outperforms in terms of completion accuracy and visualization, but also conducts more efficiently than most state-of-the-art LRTC methods.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1382144"},"PeriodicalIF":2.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11250652/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141629268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Random kernel k-nearest neighbors regression. 随机核 k 近邻回归
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-07-01 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1402384
Patchanok Srisuradetchai, Korn Suksrikran

The k-nearest neighbors (KNN) regression method, known for its nonparametric nature, is highly valued for its simplicity and its effectiveness in handling complex structured data, particularly in big data contexts. However, this method is susceptible to overfitting and fit discontinuity, which present significant challenges. This paper introduces the random kernel k-nearest neighbors (RK-KNN) regression as a novel approach that is well-suited for big data applications. It integrates kernel smoothing with bootstrap sampling to enhance prediction accuracy and the robustness of the model. This method aggregates multiple predictions using random sampling from the training dataset and selects subsets of input variables for kernel KNN (K-KNN). A comprehensive evaluation of RK-KNN on 15 diverse datasets, employing various kernel functions including Gaussian and Epanechnikov, demonstrates its superior performance. When compared to standard KNN and the random KNN (R-KNN) models, it significantly reduces the root mean square error (RMSE) and mean absolute error, as well as improving R-squared values. The RK-KNN variant that employs a specific kernel function yielding the lowest RMSE will be benchmarked against state-of-the-art methods, including support vector regression, artificial neural networks, and random forests.

k 近邻(KNN)回归方法因其非参数性质而闻名,因其简单性和处理复杂结构数据的有效性而备受推崇,尤其是在大数据背景下。然而,这种方法容易出现过拟合和拟合不连续的问题,带来了巨大的挑战。本文介绍了随机核 k 近邻(RK-KNN)回归法,这是一种非常适合大数据应用的新方法。它将核平滑与自举采样相结合,以提高预测的准确性和模型的鲁棒性。该方法使用从训练数据集随机抽样的方法汇总多个预测结果,并为核 KNN(K-KNN)选择输入变量子集。在 15 个不同的数据集上对 RK-KNN 进行了全面评估,采用了包括高斯和 Epanechnikov 在内的各种核函数,结果表明 RK-KNN 性能优越。与标准 KNN 和随机 KNN(R-KNN)模型相比,它显著降低了均方根误差(RMSE)和平均绝对误差,并提高了 R 平方值。RK-KNN 变体采用了特定的核函数,RMSE 最低,将与支持向量回归、人工神经网络和随机森林等最先进的方法进行比较。
{"title":"Random kernel k-nearest neighbors regression.","authors":"Patchanok Srisuradetchai, Korn Suksrikran","doi":"10.3389/fdata.2024.1402384","DOIUrl":"10.3389/fdata.2024.1402384","url":null,"abstract":"<p><p>The k-nearest neighbors (KNN) regression method, known for its nonparametric nature, is highly valued for its simplicity and its effectiveness in handling complex structured data, particularly in big data contexts. However, this method is susceptible to overfitting and fit discontinuity, which present significant challenges. This paper introduces the random kernel k-nearest neighbors (RK-KNN) regression as a novel approach that is well-suited for big data applications. It integrates kernel smoothing with bootstrap sampling to enhance prediction accuracy and the robustness of the model. This method aggregates multiple predictions using random sampling from the training dataset and selects subsets of input variables for kernel KNN (K-KNN). A comprehensive evaluation of RK-KNN on 15 diverse datasets, employing various kernel functions including Gaussian and Epanechnikov, demonstrates its superior performance. When compared to standard KNN and the random KNN (R-KNN) models, it significantly reduces the root mean square error (RMSE) and mean absolute error, as well as improving R-squared values. The RK-KNN variant that employs a specific kernel function yielding the lowest RMSE will be benchmarked against state-of-the-art methods, including support vector regression, artificial neural networks, and random forests.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1402384"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11246867/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141622134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Global explanation supervision for Graph Neural Networks. 图神经网络的全局解释监督
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-07-01 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1410424
Negar Etemadyrad, Yuyang Gao, Sai Manoj Pudukotai Dinakarrao, Liang Zhao

With the increasing popularity of Graph Neural Networks (GNNs) for predictive tasks on graph structured data, research on their explainability is becoming more critical and achieving significant progress. Although many methods are proposed to explain the predictions of GNNs, their focus is mainly on "how to generate explanations." However, other important research questions like "whether the GNN explanations are inaccurate," "what if the explanations are inaccurate," and "how to adjust the model to generate more accurate explanations" have gained little attention. Our previous GNN Explanation Supervision (GNES) framework demonstrated effectiveness on improving the reasonability of the local explanation while still keep or even improve the backbone GNNs model performance. In many applications instead of per sample explanations, we need to find global explanations which are reasonable and faithful to the domain data. Simply learning to explain GNNs locally is not an optimal solution to a global understanding of the model. To improve the explainability power of the GNES framework, we propose the Global GNN Explanation Supervision (GGNES) technique which uses a basic trained GNN and a global extension of the loss function used in the GNES framework. This GNN creates local explanations which are fed to a Global Logic-based GNN Explainer, an existing technique that can learn the global Explanation in terms of a logic formula. These two frameworks are then trained iteratively to generate reasonable global explanations. Extensive experiments demonstrate the effectiveness of the proposed model on improving the global explanations while keeping the performance similar or even increase the model prediction power.

随着用于图结构数据预测任务的图神经网络(GNN)越来越受欢迎,对其可解释性的研究也变得越来越重要,并取得了重大进展。虽然人们提出了很多方法来解释图神经网络的预测,但其重点主要集中在 "如何生成解释 "上。然而,"GNN 解释是否不准确"、"如果解释不准确怎么办 "以及 "如何调整模型以生成更准确的解释 "等其他重要研究问题却鲜有人关注。我们之前的 GNN 解释监督(GNES)框架在提高局部解释合理性的同时,仍能保持甚至提高骨干 GNN 模型的性能,这一点已得到证实。在许多应用中,我们需要找到合理且忠实于领域数据的全局解释,而不是按样本解释。仅仅学习对 GNN 进行局部解释,并不是实现对模型全局理解的最佳方案。为了提高 GNES 框架的可解释性,我们提出了全局 GNN 解释监督(Global GNN Explanation Supervision,GGNES)技术,该技术使用基本训练过的 GNN 和 GNES 框架中使用的损失函数的全局扩展。该 GNN 创建局部解释,并将其输入基于全局逻辑的 GNN 解释器,这是一种可以根据逻辑公式学习全局解释的现有技术。然后对这两个框架进行迭代训练,以生成合理的全局解释。广泛的实验证明了所提出的模型在改进全局解释方面的有效性,同时保持了相似的性能,甚至提高了模型的预测能力。
{"title":"Global explanation supervision for Graph Neural Networks.","authors":"Negar Etemadyrad, Yuyang Gao, Sai Manoj Pudukotai Dinakarrao, Liang Zhao","doi":"10.3389/fdata.2024.1410424","DOIUrl":"10.3389/fdata.2024.1410424","url":null,"abstract":"<p><p>With the increasing popularity of Graph Neural Networks (GNNs) for predictive tasks on graph structured data, research on their explainability is becoming more critical and achieving significant progress. Although many methods are proposed to explain the predictions of GNNs, their focus is mainly on \"how to generate explanations.\" However, other important research questions like \"whether the GNN explanations are inaccurate,\" \"what if the explanations are inaccurate,\" and \"how to adjust the model to generate more accurate explanations\" have gained little attention. Our previous GNN Explanation Supervision (GNES) framework demonstrated effectiveness on improving the reasonability of the local explanation while still keep or even improve the backbone GNNs model performance. In many applications instead of per sample explanations, we need to find global explanations which are reasonable and faithful to the domain data. Simply learning to explain GNNs locally is not an optimal solution to a global understanding of the model. To improve the explainability power of the GNES framework, we propose the Global GNN Explanation Supervision (GGNES) technique which uses a basic trained GNN and a global extension of the loss function used in the GNES framework. This GNN creates local explanations which are fed to a Global Logic-based GNN Explainer, an existing technique that can learn the global Explanation in terms of a logic formula. These two frameworks are then trained iteratively to generate reasonable global explanations. Extensive experiments demonstrate the effectiveness of the proposed model on improving the global explanations while keeping the performance similar or even increase the model prediction power.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1410424"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11246961/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141621733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
YOLOv8's advancements in tuberculosis identification from chest images. YOLOv8 在从胸部图像识别肺结核方面取得的进展。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-06-27 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1401981
Mohamudha Parveen Rahamathulla, W R Sam Emmanuel, A Bindhu, Mohamed Mustaq Ahmed

Tuberculosis (TB) is a chronic and pathogenic disease that leads to life-threatening situations like death. Many people have been affected by TB owing to inaccuracy, late diagnosis, and deficiency of treatment. The early detection of TB is important to protect people from the severity of the disease and its threatening consequences. Traditionally, different manual methods have been used for TB prediction, such as chest X-rays and CT scans. Nevertheless, these approaches are identified as time-consuming and ineffective for achieving optimal results. To resolve this problem, several researchers have focused on TB prediction. Conversely, it results in a lack of accuracy, overfitting of data, and speed. For improving TB prediction, the proposed research employs the Selection Focal Fusion (SFF) block in the You Look Only Once v8 (YOLOv8, Ultralytics software company, Los Angeles, United States) object detection model with attention mechanism through the Kaggle TBX-11k dataset. The YOLOv8 is used for its ability to detect multiple objects in a single pass. However, it struggles with small objects and finds it impossible to perform fine-grained classifications. To evade this problem, the proposed research incorporates the SFF technique to improve detection performance and decrease small object missed detection rates. Correspondingly, the efficacy of the projected mechanism is calculated utilizing various performance metrics such as recall, precision, F1Score, and mean Average Precision (mAP) to estimate the performance of the proposed framework. Furthermore, the comparison of existing models reveals the efficiency of the proposed research. The present research is envisioned to contribute to the medical world and assist radiologists in identifying tuberculosis using the YOLOv8 model to obtain an optimal outcome.

肺结核(TB)是一种慢性致病性疾病,可导致死亡等危及生命的情况。由于诊断不准确、晚期诊断和缺乏治疗,许多人受到结核病的影响。结核病的早期检测对于保护人们免受疾病的严重性及其威胁性后果的影响非常重要。传统上,人们使用不同的人工方法来预测结核病,如胸部 X 光和 CT 扫描。然而,这些方法都被认为费时费力,无法达到最佳效果。为了解决这一问题,一些研究人员将重点放在结核病预测上。然而,这些方法的缺点是缺乏准确性、数据过度拟合和速度过快。为了改进结核病预测,本研究建议通过 Kaggle TBX-11k 数据集,在带有注意力机制的 You Look Only Once v8(YOLOv8,Ultralytics 软件公司,美国洛杉矶)物体检测模型中使用选择焦点融合(SFF)模块。YOLOv8 能够一次性检测多个物体。然而,它在处理小物体时会遇到困难,无法进行细粒度分类。为了解决这个问题,拟议的研究采用了 SFF 技术来提高检测性能,降低小物体的漏检率。相应地,利用各种性能指标(如召回率、精确度、F1Score 和平均精确度 (mAP) 等)来计算预测机制的功效,以估算拟议框架的性能。此外,与现有模型的比较也揭示了拟议研究的效率。本研究旨在为医学界做出贡献,协助放射科医生使用 YOLOv8 模型识别肺结核,以获得最佳结果。
{"title":"YOLOv8's advancements in tuberculosis identification from chest images.","authors":"Mohamudha Parveen Rahamathulla, W R Sam Emmanuel, A Bindhu, Mohamed Mustaq Ahmed","doi":"10.3389/fdata.2024.1401981","DOIUrl":"10.3389/fdata.2024.1401981","url":null,"abstract":"<p><p>Tuberculosis (TB) is a chronic and pathogenic disease that leads to life-threatening situations like death. Many people have been affected by TB owing to inaccuracy, late diagnosis, and deficiency of treatment. The early detection of TB is important to protect people from the severity of the disease and its threatening consequences. Traditionally, different manual methods have been used for TB prediction, such as chest X-rays and CT scans. Nevertheless, these approaches are identified as time-consuming and ineffective for achieving optimal results. To resolve this problem, several researchers have focused on TB prediction. Conversely, it results in a lack of accuracy, overfitting of data, and speed. For improving TB prediction, the proposed research employs the Selection Focal Fusion (SFF) block in the You Look Only Once v8 (YOLOv8, Ultralytics software company, Los Angeles, United States) object detection model with attention mechanism through the Kaggle TBX-11k dataset. The YOLOv8 is used for its ability to detect multiple objects in a single pass. However, it struggles with small objects and finds it impossible to perform fine-grained classifications. To evade this problem, the proposed research incorporates the SFF technique to improve detection performance and decrease small object missed detection rates. Correspondingly, the efficacy of the projected mechanism is calculated utilizing various performance metrics such as recall, precision, F1Score, and mean Average Precision (mAP) to estimate the performance of the proposed framework. Furthermore, the comparison of existing models reveals the efficiency of the proposed research. The present research is envisioned to contribute to the medical world and assist radiologists in identifying tuberculosis using the YOLOv8 model to obtain an optimal outcome.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1401981"},"PeriodicalIF":2.4,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11236731/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141592057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MedT5SQL: a transformers-based large language model for text-to-SQL conversion in the healthcare domain. MedT5SQL:基于转换器的大型语言模型,用于医疗保健领域文本到 SQL 的转换。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-06-26 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1371680
Alaa Marshan, Anwar Nais Almutairi, Athina Ioannou, David Bell, Asmat Monaghan, Mahir Arzoky

Introduction: In response to the increasing prevalence of electronic medical records (EMRs) stored in databases, healthcare staff are encountering difficulties retrieving these records due to their limited technical expertise in database operations. As these records are crucial for delivering appropriate medical care, there is a need for an accessible method for healthcare staff to access EMRs.

Methods: To address this, natural language processing (NLP) for Text-to-SQL has emerged as a solution, enabling non-technical users to generate SQL queries using natural language text. This research assesses existing work on Text-to-SQL conversion and proposes the MedT5SQL model specifically designed for EMR retrieval. The proposed model utilizes the Text-to-Text Transfer Transformer (T5) model, a Large Language Model (LLM) commonly used in various text-based NLP tasks. The model is fine-tuned on the MIMICSQL dataset, the first Text-to-SQL dataset for the healthcare domain. Performance evaluation involves benchmarking the MedT5SQL model on two optimizers, varying numbers of training epochs, and using two datasets, MIMICSQL and WikiSQL.

Results: For MIMICSQL dataset, the model demonstrates considerable effectiveness in generating question-SQL pairs achieving accuracy of 80.63%, 98.937%, and 90% for exact match accuracy matrix, approximate string-matching, and manual evaluation, respectively. When testing the performance of the model on WikiSQL dataset, the model demonstrates efficiency in generating SQL queries, with an accuracy of 44.2% on WikiSQL and 94.26% for approximate string-matching.

Discussion: Results indicate improved performance with increased training epochs. This work highlights the potential of fine-tuned T5 model to convert medical-related questions written in natural language to Structured Query Language (SQL) in healthcare domain, providing a foundation for future research in this area.

导言:随着存储在数据库中的电子病历(EMR)的日益普及,医护人员由于数据库操作方面的专业技术有限,在检索这些病历时遇到了困难。由于这些记录对提供适当的医疗服务至关重要,因此需要一种便于医护人员访问 EMR 的方法:为解决这一问题,文本到 SQL 的自然语言处理(NLP)已成为一种解决方案,使非技术用户能够使用自然语言文本生成 SQL 查询。本研究评估了现有的文本到 SQL 转换工作,并提出了专为 EMR 检索设计的 MedT5SQL 模型。所提议的模型利用了文本到文本转换器(T5)模型,这是一种常用于各种基于文本的 NLP 任务的大型语言模型(LLM)。该模型在 MIMICSQL 数据集上进行了微调,这是医疗保健领域首个文本到 SQL 数据集。性能评估包括在两个优化器上对 MedT5SQL 模型进行基准测试,使用两个数据集(MIMICSQL 和 WikiSQL)进行不同数量的训练历时:对于MIMICSQL数据集,该模型在生成问题-SQL对方面表现出了相当高的效率,在精确匹配精度矩阵、近似字符串匹配和人工评估方面的准确率分别达到了80.63%、98.937%和90%。在 WikiSQL 数据集上测试该模型的性能时,该模型显示出生成 SQL 查询的效率,在 WikiSQL 数据集上的准确率为 44.2%,近似字符串匹配的准确率为 94.26%:讨论:结果表明,随着训练历时的增加,性能也有所提高。这项工作凸显了微调 T5 模型将医疗保健领域中以自然语言编写的医学相关问题转换为结构化查询语言(SQL)的潜力,为该领域的未来研究奠定了基础。
{"title":"MedT5SQL: a transformers-based large language model for text-to-SQL conversion in the healthcare domain.","authors":"Alaa Marshan, Anwar Nais Almutairi, Athina Ioannou, David Bell, Asmat Monaghan, Mahir Arzoky","doi":"10.3389/fdata.2024.1371680","DOIUrl":"10.3389/fdata.2024.1371680","url":null,"abstract":"<p><strong>Introduction: </strong>In response to the increasing prevalence of electronic medical records (EMRs) stored in databases, healthcare staff are encountering difficulties retrieving these records due to their limited technical expertise in database operations. As these records are crucial for delivering appropriate medical care, there is a need for an accessible method for healthcare staff to access EMRs.</p><p><strong>Methods: </strong>To address this, natural language processing (NLP) for Text-to-SQL has emerged as a solution, enabling non-technical users to generate SQL queries using natural language text. This research assesses existing work on Text-to-SQL conversion and proposes the MedT5SQL model specifically designed for EMR retrieval. The proposed model utilizes the Text-to-Text Transfer Transformer (T5) model, a Large Language Model (LLM) commonly used in various text-based NLP tasks. The model is fine-tuned on the MIMICSQL dataset, the first Text-to-SQL dataset for the healthcare domain. Performance evaluation involves benchmarking the MedT5SQL model on two optimizers, varying numbers of training epochs, and using two datasets, MIMICSQL and WikiSQL.</p><p><strong>Results: </strong>For MIMICSQL dataset, the model demonstrates considerable effectiveness in generating question-SQL pairs achieving accuracy of 80.63%, 98.937%, and 90% for exact match accuracy matrix, approximate string-matching, and manual evaluation, respectively. When testing the performance of the model on WikiSQL dataset, the model demonstrates efficiency in generating SQL queries, with an accuracy of 44.2% on WikiSQL and 94.26% for approximate string-matching.</p><p><strong>Discussion: </strong>Results indicate improved performance with increased training epochs. This work highlights the potential of fine-tuned T5 model to convert medical-related questions written in natural language to Structured Query Language (SQL) in healthcare domain, providing a foundation for future research in this area.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1371680"},"PeriodicalIF":2.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11233734/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141581493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Frontiers in Big Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1