首页 > 最新文献

Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data最新文献

英文 中文
Zero-shot Learning with Minimum Instruction to Extract Social Determinants and Family History from Clinical Notes using GPT Model. 使用 GPT 模型从临床笔记中提取社会决定因素和家族史的零点学习(Zero-shot Learning with Minimum Instruction)。
Neel Jitesh Bhate, Ansh Mittal, Zhe He, Xiao Luo

Demographics, social determinants of health, and family history documented in the unstructured text within the electronic health records are increasingly being studied to understand how this information can be utilized with the structured data to improve healthcare outcomes. After the GPT models were released, many studies have applied GPT models to extract this information from the narrative clinical notes. Different from the existing work, our research focuses on investigating the zero-shot learning on extracting this information together by providing minimum information to the GPT model. We utilize de-identified real-world clinical notes annotated for demographics, various social determinants, and family history information. Given that the GPT model might provide text different from the text in the original data, we explore two sets of evaluation metrics, including the traditional NER evaluation metrics and semantic similarity evaluation metrics, to completely understand the performance. Our results show that the GPT-3.5 method achieved an average of 0.975 F1 on demographics extraction, 0.615 F1 on social determinants extraction, and 0.722 F1 on family history extraction. We believe these results can be further improved through model fine-tuning or few-shots learning. Through the case studies, we also identified the limitations of the GPT models, which need to be addressed in future research.

电子健康记录中的非结构化文本中记录的人口统计学、健康的社会决定因素和家族史正越来越多地被研究,以了解如何将这些信息与结构化数据一起用于改善医疗效果。在 GPT 模型发布后,许多研究都应用 GPT 模型从叙述性临床笔记中提取这些信息。与现有工作不同的是,我们的研究侧重于研究通过向 GPT 模型提供最少信息来提取这些信息的零点学习。我们利用去标识化的真实世界临床笔记,注释人口统计学、各种社会决定因素和家族史信息。鉴于 GPT 模型提供的文本可能与原始数据中的文本不同,我们探索了两套评价指标,包括传统的 NER 评价指标和语义相似性评价指标,以全面了解其性能。我们的结果表明,GPT-3.5 方法在人口统计学提取方面的平均 F1 值为 0.975,在社会决定因素提取方面的平均 F1 值为 0.615,在家族史提取方面的平均 F1 值为 0.722。我们相信,通过模型微调或少量学习,这些结果还能进一步提高。通过案例研究,我们还发现了 GPT 模型的局限性,这需要在今后的研究中加以解决。
{"title":"Zero-shot Learning with Minimum Instruction to Extract Social Determinants and Family History from Clinical Notes using GPT Model.","authors":"Neel Jitesh Bhate, Ansh Mittal, Zhe He, Xiao Luo","doi":"10.1109/BigData59044.2023.10386811","DOIUrl":"10.1109/BigData59044.2023.10386811","url":null,"abstract":"<p><p>Demographics, social determinants of health, and family history documented in the unstructured text within the electronic health records are increasingly being studied to understand how this information can be utilized with the structured data to improve healthcare outcomes. After the GPT models were released, many studies have applied GPT models to extract this information from the narrative clinical notes. Different from the existing work, our research focuses on investigating the zero-shot learning on extracting this information together by providing minimum information to the GPT model. We utilize de-identified real-world clinical notes annotated for demographics, various social determinants, and family history information. Given that the GPT model might provide text different from the text in the original data, we explore two sets of evaluation metrics, including the traditional NER evaluation metrics and semantic similarity evaluation metrics, to completely understand the performance. Our results show that the GPT-3.5 method achieved an average of 0.975 F1 on demographics extraction, 0.615 F1 on social determinants extraction, and 0.722 F1 on family history extraction. We believe these results can be further improved through model fine-tuning or few-shots learning. Through the case studies, we also identified the limitations of the GPT models, which need to be addressed in future research.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":"2023 ","pages":"1476-1480"},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11295958/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141891194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Private Continuous Survival Analysis with Distributed Multi-Site Data. 利用分布式多站点数据进行私人连续生存分析
Luca Bonomi, Marilyn Lionts, Liyue Fan

Effective disease surveillance systems require large-scale epidemiological data to improve health outcomes and quality of care for the general population. As data may be limited within a single site, multi-site data (e.g., from a number of local/regional health systems) need to be considered. Leveraging distributed data across multiple sites for epidemiological analysis poses significant challenges. Due to the sensitive nature of epidemiological data, it is imperative to design distributed solutions that provide strong privacy protections. Current privacy solutions often assume a central site, which is responsible for aggregating the distributed data and applying privacy protection before sharing the results (e.g., aggregation via secure primitives and differential privacy for sharing aggregate results). However, identifying such a central site may be difficult in practice and relying on a central site may introduce potential vulnerabilities (e.g., single point of failure). Furthermore, to support clinical interventions and inform policy decisions in a timely manner, epidemiological analysis need to reflect dynamic changes in the data. Yet, existing distributed privacy-protecting approaches were largely designed for static data (e.g., one-time data sharing) and cannot fulfill dynamic data requirements. In this work, we propose a privacy-protecting approach that supports the sharing of dynamic epidemiological analysis and provides strong privacy protection in a decentralized manner. We apply our solution in continuous survival analysis using the Kaplan-Meier estimation model while providing differential privacy protection. Our evaluations on a real dataset containing COVID-19 cases show that our method provides highly usable results.

有效的疾病监测系统需要大规模的流行病学数据,以改善大众的健康状况和医疗质量。由于单个地点的数据可能有限,因此需要考虑多地点数据(如来自多个地方/区域卫生系统的数据)。利用多个地点的分布式数据进行流行病学分析是一项重大挑战。由于流行病学数据的敏感性,设计能提供强大隐私保护的分布式解决方案势在必行。当前的隐私解决方案通常假定有一个中心站点,负责聚合分布式数据,并在共享结果前应用隐私保护(例如,通过安全基元进行聚合,并在共享聚合结果时应用差分隐私)。然而,在实践中可能很难确定这样一个中心站,而且依赖中心站可能会带来潜在的漏洞(如单点故障)。此外,为了支持临床干预并及时为政策决策提供信息,流行病学分析需要反映数据的动态变化。然而,现有的分布式隐私保护方法主要是针对静态数据(如一次性数据共享)设计的,无法满足动态数据要求。在这项工作中,我们提出了一种隐私保护方法,它支持动态流行病学分析的共享,并以分散的方式提供强大的隐私保护。我们使用 Kaplan-Meier 估计模型将我们的解决方案应用于连续生存分析,同时提供不同的隐私保护。我们在包含 COVID-19 病例的真实数据集上进行的评估表明,我们的方法提供了高度可用的结果。
{"title":"Private Continuous Survival Analysis with Distributed Multi-Site Data.","authors":"Luca Bonomi, Marilyn Lionts, Liyue Fan","doi":"10.1109/BigData59044.2023.10386571","DOIUrl":"10.1109/BigData59044.2023.10386571","url":null,"abstract":"<p><p>Effective disease surveillance systems require large-scale epidemiological data to improve health outcomes and quality of care for the general population. As data may be limited within a single site, multi-site data (e.g., from a number of local/regional health systems) need to be considered. Leveraging distributed data across multiple sites for epidemiological analysis poses significant challenges. Due to the sensitive nature of epidemiological data, it is imperative to design distributed solutions that provide strong privacy protections. Current privacy solutions often assume a central site, which is responsible for aggregating the distributed data and applying privacy protection before sharing the results (e.g., aggregation via secure primitives and differential privacy for sharing aggregate results). However, identifying such a central site may be difficult in practice and relying on a central site may introduce potential vulnerabilities (e.g., single point of failure). Furthermore, to support clinical interventions and inform policy decisions in a timely manner, epidemiological analysis need to reflect dynamic changes in the data. Yet, existing distributed privacy-protecting approaches were largely designed for static data (e.g., one-time data sharing) and cannot fulfill dynamic data requirements. In this work, we propose a privacy-protecting approach that supports the sharing of dynamic epidemiological analysis and provides strong privacy protection in a decentralized manner. We apply our solution in continuous survival analysis using the Kaplan-Meier estimation model while providing differential privacy protection. Our evaluations on a real dataset containing COVID-19 cases show that our method provides highly usable results.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":"2023 ","pages":"5444-5453"},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10997374/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140859304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Doctors vs. Nurses: Understanding the Great Divide in Vaccine Hesitancy among Healthcare Workers. 医生与护士:了解医疗工作者在疫苗犹豫方面的巨大分歧。
Sajid Hussain Rafi Ahamed, Shahid Shakil, Hanjia Lyu, Xinping Zhang, Jiebo Luo

Healthcare workers such as doctors and nurses are expected to be trustworthy and creditable sources of vaccine-related information. Their opinions toward the COVID-19 vaccines may influence the vaccine uptake among the general population. However, vaccine hesitancy is still an important issue even among the healthcare workers. Therefore, it is critical to understand their opinions to help reduce the level of vaccine hesitancy. There have been studies examining healthcare workers' viewpoints on COVID-19 vaccines using questionnaires. Reportedly, a considerably higher proportion of vaccine hesitancy is observed among nurses, compared to doctors. We intend to verify and study this phenomenon at a much larger scale and in fine grain using social media data, which has been effectively and efficiently leveraged by researchers to address real-world issues during the COVID-19 pandemic. More specifically, we use a keyword search to identify healthcare workers and further classify them into doctors and nurses from the profile descriptions of the corresponding Twitter users. Moreover, we apply a transformer-based language model to remove irrelevant tweets. Sentiment analysis and topic modeling are employed to analyze and compare the sentiment and thematic differences in the tweets posted by doctors and nurses. We find that doctors are overall more positive toward the COVID-19 vaccines. The focuses of doctors and nurses when they discuss vaccines in a negative way are in general different. Doctors are more concerned with the effectiveness of the vaccines over newer variants while nurses pay more attention to the potential side effects on children. Therefore, we suggest that more customized strategies should be deployed when communicating with different groups of healthcare workers.

医生和护士等卫生保健工作者应是值得信赖和可信的疫苗相关信息来源。他们对COVID-19疫苗的看法可能会影响普通人群的疫苗接种。然而,即使在卫生保健工作者中,疫苗犹豫仍然是一个重要问题。因此,了解他们的意见有助于减少疫苗犹豫的程度是至关重要的。有研究通过问卷调查调查了医护人员对COVID-19疫苗的看法。据报道,与医生相比,护士对疫苗犹豫的比例要高得多。我们打算利用社交媒体数据在更大范围和更细粒度上验证和研究这一现象,研究人员已经有效地利用社交媒体数据来解决COVID-19大流行期间的现实问题。更具体地说,我们使用关键字搜索来识别医护人员,并根据相应Twitter用户的个人资料描述将其进一步分类为医生和护士。此外,我们应用基于转换器的语言模型来删除不相关的推文。运用情感分析和话题建模对医生和护士发布的推文的情感和主题差异进行分析和比较。我们发现,医生总体上对COVID-19疫苗持更积极的态度。当医生和护士以消极的方式讨论疫苗时,他们的关注点通常是不同的。医生更关心疫苗的有效性,而护士则更关注疫苗对儿童的潜在副作用。因此,我们建议在与不同的医疗工作者群体沟通时应采用更定制的策略。
{"title":"Doctors vs. Nurses: Understanding the Great Divide in Vaccine Hesitancy among Healthcare Workers.","authors":"Sajid Hussain Rafi Ahamed,&nbsp;Shahid Shakil,&nbsp;Hanjia Lyu,&nbsp;Xinping Zhang,&nbsp;Jiebo Luo","doi":"10.1109/bigdata55660.2022.10020853","DOIUrl":"https://doi.org/10.1109/bigdata55660.2022.10020853","url":null,"abstract":"<p><p>Healthcare workers such as doctors and nurses are expected to be trustworthy and creditable sources of vaccine-related information. Their opinions toward the COVID-19 vaccines may influence the vaccine uptake among the general population. However, vaccine hesitancy is still an important issue even among the healthcare workers. Therefore, it is critical to understand their opinions to help reduce the level of vaccine hesitancy. There have been studies examining healthcare workers' viewpoints on COVID-19 vaccines using questionnaires. Reportedly, a considerably higher proportion of vaccine hesitancy is observed among nurses, compared to doctors. We intend to verify and study this phenomenon at a much larger scale and in fine grain using social media data, which has been effectively and efficiently leveraged by researchers to address real-world issues during the COVID-19 pandemic. More specifically, we use a keyword search to identify healthcare workers and further classify them into doctors and nurses from the profile descriptions of the corresponding Twitter users. Moreover, we apply a transformer-based language model to remove irrelevant tweets. Sentiment analysis and topic modeling are employed to analyze and compare the sentiment and thematic differences in the tweets posted by doctors and nurses. We find that doctors are overall more positive toward the COVID-19 vaccines. The focuses of doctors and nurses when they discuss vaccines in a negative way are in general <i>different</i>. Doctors are more concerned with the effectiveness of the vaccines over newer variants while nurses pay more attention to the potential side effects on children. Therefore, we suggest that more customized strategies should be deployed when communicating with different groups of healthcare workers.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":"2022 ","pages":"5865-5870"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10208360/pdf/nihms-1891951.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9589125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Multi-Query Optimization Revisited: A Full-Query Algebraic Method. 多查询优化:一种全查询代数方法。
Yicheng Tu, Mehrad Eslami, Zichen Xu, Hadi Charkhgard

Sharing data and computation among concurrent queries has been an active research topic in database systems. While work in this area developed algorithms and systems that are shown to be effective, there is a lack of logical foundation for query processing and optimization. In this paper, we present PsiDB, a system model for processing a large number of database queries in a batch. The key idea is to generate a single query expression that returns a global relation containing all the data needed for individual queries. For that, we propose the use of a type of relational operators called ψ-operators in combining the individual queries into the global expression. We tackle the algebraic optimization problem in PsiDB by developing equivalence rules to transform concurrent queries with the purpose of revealing query optimization opportunities. Centering around the ψ-operator, our rules not only cover many optimization techniques adopted in existing batch processing systems, but also revealed new optimization opportunities. Experiments conducted on an early prototype of PsiDB show a performance improvement of up to 36X over a mainstream commercial DBMS.

在并发查询之间共享数据和计算一直是数据库系统中一个活跃的研究课题。虽然这一领域的工作开发了有效的算法和系统,但缺乏查询处理和优化的逻辑基础。本文提出了一种用于批量处理大量数据库查询的系统模型PsiDB。关键思想是生成一个查询表达式,该表达式返回一个全局关系,其中包含各个查询所需的所有数据。为此,我们建议使用一种称为ψ-运算符的关系运算符将单个查询组合到全局表达式中。我们通过开发等价规则来转换并发查询,从而揭示查询优化机会,从而解决PsiDB中的代数优化问题。以ψ算子为中心,我们的规则不仅涵盖了现有批处理系统中采用的许多优化技术,而且揭示了新的优化机会。在PsiDB的早期原型上进行的实验表明,与主流商业DBMS相比,PsiDB的性能提高了36倍。
{"title":"Multi-Query Optimization Revisited: A Full-Query Algebraic Method.","authors":"Yicheng Tu,&nbsp;Mehrad Eslami,&nbsp;Zichen Xu,&nbsp;Hadi Charkhgard","doi":"10.1109/bigdata55660.2022.10020338","DOIUrl":"https://doi.org/10.1109/bigdata55660.2022.10020338","url":null,"abstract":"<p><p>Sharing data and computation among concurrent queries has been an active research topic in database systems. While work in this area developed algorithms and systems that are shown to be effective, there is a lack of logical foundation for query processing and optimization. In this paper, we present PsiDB, a system model for processing a large number of database queries in a batch. The key idea is to generate a single query expression that returns a global relation containing all the data needed for individual queries. For that, we propose the use of a type of relational operators called <math><mi>ψ</mi></math>-operators in combining the individual queries into the global expression. We tackle the algebraic optimization problem in PsiDB by developing equivalence rules to transform concurrent queries with the purpose of revealing query optimization opportunities. Centering around the <math><mi>ψ</mi></math>-operator, our rules not only cover many optimization techniques adopted in existing batch processing systems, but also revealed new optimization opportunities. Experiments conducted on an early prototype of PsiDB show a performance improvement of up to 36X over a mainstream commercial DBMS.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":"2022 ","pages":"252-261"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10460125/pdf/nihms-1917822.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10465128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Graph-guided Bayesian SVM with Adaptive Structured Shrinkage Prior for High-dimensional Data. 高维数据自适应结构收缩先验的图导贝叶斯支持向量机。
Wenli Sun, Changgee Chang, Qi Long

Support vector machine (SVM) is a popular classification method for the analysis of a wide range of data including big biomedical data. Many SVM methods with feature selection have been developed under the frequentist regularization or Bayesian shrinkage frameworks. On the other hand, the value of incorporating a priori known biological knowledge, such as those from functional genomics and functional proteomics, into statistical analysis of -omic data has been recognized in recent years. Such biological information is often represented by graphs. We propose a novel method that assigns Laplace priors to the regression coefficients and incorporates the underlying graph information via a hyper-prior for the shrinkage parameters in the Laplace priors. This enables smoothing of shrinkage parameters for connected variables in the graph and conditional independence between shrinkage parameters for disconnected variables. Extensive simulations demonstrate that our proposed methods achieve the best performance compared to the other existing SVM methods in terms of prediction accuracy. The proposed method are also illustrated in analysis of genomic data from cancer studies, demonstrating its advantage in generating biologically meaningful results and identifying potentially important features.

支持向量机(SVM)是一种流行的分类方法,用于分析包括大型生物医学数据在内的广泛数据。在频率正则化或贝叶斯收缩框架下,已经开发了许多具有特征选择的SVM方法。另一方面,近年来,将先验的已知生物学知识,如功能基因组学和功能蛋白质组学的知识,纳入组学数据的统计分析中的价值已得到认可。这种生物信息通常用图表表示。我们提出了一种新的方法,该方法将拉普拉斯先验分配给回归系数,并通过拉普拉斯先验中收缩参数的超先验结合底层图信息。这样可以平滑图中连接变量的收缩参数,并使断开连接的变量收缩参数之间的条件独立性。大量仿真表明,与其他现有的SVM方法相比,我们提出的方法在预测精度方面实现了最佳性能。所提出的方法也在癌症研究的基因组数据分析中得到了说明,证明了其在产生有生物学意义的结果和识别潜在重要特征方面的优势。
{"title":"Graph-guided Bayesian SVM with Adaptive Structured Shrinkage Prior for High-dimensional Data.","authors":"Wenli Sun,&nbsp;Changgee Chang,&nbsp;Qi Long","doi":"10.1109/bigdata52589.2021.9671712","DOIUrl":"10.1109/bigdata52589.2021.9671712","url":null,"abstract":"<p><p>Support vector machine (SVM) is a popular classification method for the analysis of a wide range of data including big biomedical data. Many SVM methods with feature selection have been developed under the frequentist regularization or Bayesian shrinkage frameworks. On the other hand, the value of incorporating a priori known biological knowledge, such as those from functional genomics and functional proteomics, into statistical analysis of -omic data has been recognized in recent years. Such biological information is often represented by graphs. We propose a novel method that assigns Laplace priors to the regression coefficients and incorporates the underlying graph information via a hyper-prior for the shrinkage parameters in the Laplace priors. This enables smoothing of shrinkage parameters for connected variables in the graph and conditional independence between shrinkage parameters for disconnected variables. Extensive simulations demonstrate that our proposed methods achieve the best performance compared to the other existing SVM methods in terms of prediction accuracy. The proposed method are also illustrated in analysis of genomic data from cancer studies, demonstrating its advantage in generating biologically meaningful results and identifying potentially important features.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":" ","pages":"4472-4479"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8855458/pdf/nihms-1776624.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39941357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HPCGCN: A Predictive Framework on High Performance Computing Cluster Log Data Using Graph Convolutional Networks. HPCGCN:基于图卷积网络的高性能计算集群日志数据预测框架。
Avishek Bose, Huichen Yang, William H Hsu, Daniel Andresen

This paper presents a novel use case of Graph Convolutional Network (GCN) learning representations for predictive data mining, specifically from user/task data in the domain of high-performance computing (HPC). It outlines an approach based on a coalesced data set: logs from the Slurm workload manager, joined with user experience survey data from computational cluster users. We introduce a new method of constructing a heterogeneous unweighted HPC graph consisting of multiple typed nodes after revealing the manifold relations between the nodes. The GCN structure used here supports two tasks: i) determining whether a job will complete or fail and ii) predicting memory and CPU requirements by training the GCN semi-supervised classification model and regression models on the generated graph. The graph is partitioned into partitions using graph clustering. We conducted classification and regression experiments using the proposed framework on our HPC log dataset and evaluated predictions by our trained models against baselines using test_score, F1-score, precision, recall for classification, and R1 score for regression, showing that our framework achieves significant improvements.

本文提出了图卷积网络(GCN)学习表示用于预测数据挖掘的新用例,特别是高性能计算(HPC)领域的用户/任务数据。它概述了一种基于合并数据集的方法:来自Slurm工作负载管理器的日志,与来自计算集群用户的用户体验调查数据相结合。通过揭示节点间的流形关系,提出了一种构造由多个类型节点组成的异构无加权HPC图的新方法。这里使用的GCN结构支持两个任务:i)确定作业是完成还是失败;ii)通过在生成的图上训练GCN半监督分类模型和回归模型来预测内存和CPU需求。使用图聚类将图划分为多个分区。我们在HPC日志数据集上使用提出的框架进行了分类和回归实验,并使用test_score、F1-score、precision、recall(分类召回)和R1 score(回归)对基线进行了训练模型的预测,表明我们的框架取得了显著的改进。
{"title":"HPCGCN: A Predictive Framework on High Performance Computing Cluster Log Data Using Graph Convolutional Networks.","authors":"Avishek Bose,&nbsp;Huichen Yang,&nbsp;William H Hsu,&nbsp;Daniel Andresen","doi":"10.1109/bigdata52589.2021.9671370","DOIUrl":"https://doi.org/10.1109/bigdata52589.2021.9671370","url":null,"abstract":"<p><p>This paper presents a novel use case of Graph Convolutional Network (GCN) learning representations for predictive data mining, specifically from user/task data in the domain of high-performance computing (HPC). It outlines an approach based on a coalesced data set: logs from the Slurm workload manager, joined with user experience survey data from computational cluster users. We introduce a new method of constructing a heterogeneous unweighted HPC graph consisting of multiple typed nodes after revealing the manifold relations between the nodes. The GCN structure used here supports two tasks: i) determining whether a job will complete or fail and ii) predicting memory and CPU requirements by training the GCN semi-supervised classification model and regression models on the generated graph. The graph is partitioned into partitions using graph clustering. We conducted classification and regression experiments using the proposed framework on our HPC log dataset and evaluated predictions by our trained models against baselines using test_score, F1-score, precision, recall for classification, and R1 score for regression, showing that our framework achieves significant improvements.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":"2021 ","pages":"4113-4118"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9893918/pdf/nihms-1831840.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10718780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
HDMF: Hierarchical Data Modeling Framework for Modern Science Data Standards. 现代科学数据标准的层次数据建模框架。
Pub Date : 2019-12-01 Epub Date: 2020-02-24 DOI: 10.1109/bigdata47090.2019.9005648
Andrew J Tritt, Oliver Rübel, Benjamin Dichter, Ryan Ly, Donghe Kang, Edward F Chang, Loren M Frank, Kristofer Bouchard

A ubiquitous problem in aggregating data across different experimental and observational data sources is a lack of software infrastructure that enables flexible and extensible standardization of data and metadata. To address this challenge, we developed HDMF, a hierarchical data modeling framework for modern science data standards. With HDMF, we separate the process of data standardization into three main components: (1) data modeling and specification, (2) data I/O and storage, and (3) data interaction and data APIs. To enable standards to support the complex requirements and varying use cases throughout the data life cycle, HDMF provides object mapping infrastructure to insulate and integrate these various components. This approach supports the flexible development of data standards and extensions, optimized storage backends, and data APIs, while allowing the other components of the data standards ecosystem to remain stable. To meet the demands of modern, large-scale science data, HDMF provides advanced data I/O functionality for iterative data write, lazy data load, and parallel I/O. It also supports optimization of data storage via support for chunking, compression, linking, and modular data storage. We demonstrate the application of HDMF in practice to design NWB 2.0 [13], a modern data standard for collaborative science across the neurophysiology community.

在聚合不同实验和观测数据源的数据时,一个普遍存在的问题是缺乏能够实现数据和元数据的灵活和可扩展标准化的软件基础设施。为了应对这一挑战,我们开发了HDMF,这是一个用于现代科学数据标准的分层数据建模框架。使用HDMF,我们将数据标准化过程分为三个主要组件:(1)数据建模和规范,(2)数据I/O和存储,以及(3)数据交互和数据API。为了使标准能够在整个数据生命周期中支持复杂的需求和不同的用例,HDMF提供了对象映射基础设施来隔离和集成这些不同的组件。这种方法支持数据标准和扩展、优化的存储后端和数据API的灵活开发,同时允许数据标准生态系统的其他组件保持稳定。为了满足现代大规模科学数据的需求,HDMF提供了用于迭代数据写入、延迟数据加载和并行I/O的高级数据I/O功能。它还通过支持分块、压缩、链接和模块化数据存储来支持数据存储的优化。我们展示了HDMF在实践中的应用,以设计NWB 2.0[13],这是神经生理学界合作科学的现代数据标准。
{"title":"HDMF: Hierarchical Data Modeling Framework for Modern Science Data Standards.","authors":"Andrew J Tritt,&nbsp;Oliver Rübel,&nbsp;Benjamin Dichter,&nbsp;Ryan Ly,&nbsp;Donghe Kang,&nbsp;Edward F Chang,&nbsp;Loren M Frank,&nbsp;Kristofer Bouchard","doi":"10.1109/bigdata47090.2019.9005648","DOIUrl":"10.1109/bigdata47090.2019.9005648","url":null,"abstract":"<p><p>A ubiquitous problem in aggregating data across different experimental and observational data sources is a lack of software infrastructure that enables flexible and extensible standardization of data and metadata. To address this challenge, we developed HDMF, a hierarchical data modeling framework for modern science data standards. With HDMF, we separate the process of data standardization into three main components: (1) data modeling and specification, (2) data I/O and storage, and (3) data interaction and data APIs. To enable standards to support the complex requirements and varying use cases throughout the data life cycle, HDMF provides object mapping infrastructure to insulate and integrate these various components. This approach supports the flexible development of data standards and extensions, optimized storage backends, and data APIs, while allowing the other components of the data standards ecosystem to remain stable. To meet the demands of modern, large-scale science data, HDMF provides advanced data I/O functionality for iterative data write, lazy data load, and parallel I/O. It also supports optimization of data storage via support for chunking, compression, linking, and modular data storage. We demonstrate the application of HDMF in practice to design NWB 2.0 [13], a modern data standard for collaborative science across the neurophysiology community.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":" ","pages":"165-179"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/bigdata47090.2019.9005648","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39504793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Bayesian Non-linear Support Vector Machine for High-Dimensional Data with Incorporation of Graph Information on Features. 基于特征图信息的高维数据贝叶斯非线性支持向量机。
Pub Date : 2019-12-01 Epub Date: 2020-02-24 DOI: 10.1109/bigdata47090.2019.9006473
Wenli Sun, Changgee Chang, Qi Long

Support vector machine (SVM) is a popular classification method for analysis of high dimensional data such as genomics data. Recently a number of linear SVM methods have been developed to achieve feature selection through either frequentist regularization or Bayesian shrinkage, but the linear assumption may not be plausible for many real applications. In addition, recent work has demonstrated that incorporating known biological knowledge, such as those from functional genomics, into the statistical analysis of genomic data offers great promise of improved predictive accuracy and feature selection. Such biological knowledge can often be represented by graphs. In this article, we propose a novel knowledge-guided nonlinear Bayesian SVM approach for analysis of high-dimensional data. Our model uses graph information that represents the relationship among the features to guide feature selection. To achieve knowledge-guided feature selection, we assign an Ising prior to the indicators representing inclusion/exclusion of the features in the model. An efficient MCMC algorithm is developed for posterior inference. The performance of our method is evaluated and compared with several penalized linear SVM and the standard kernel SVM method in terms of prediction and feature selection in extensive simulation studies. Also, analyses of genomic data from a cancer study show that our method yields a more accurate prediction model for patient survival and reveals biologically more meaningful results than the existing methods.

支持向量机(SVM)是一种常用的用于高维数据分析的分类方法,如基因组数据。最近已经开发了许多线性支持向量机方法,通过频率正则化或贝叶斯收缩来实现特征选择,但线性假设可能不适合许多实际应用。此外,最近的研究表明,将已知的生物学知识,如功能基因组学的知识,纳入基因组数据的统计分析,有望提高预测准确性和特征选择。这样的生物学知识通常可以用图表来表示。在本文中,我们提出了一种新的知识引导的非线性贝叶斯支持向量机方法来分析高维数据。我们的模型使用表示特征之间关系的图形信息来指导特征选择。为了实现知识引导的特征选择,我们在表示模型中特征的包含/排除的指标之前分配了一个Ising。提出了一种高效的MCMC后验推理算法。在大量的仿真研究中,我们的方法在预测和特征选择方面与几种惩罚线性支持向量机和标准核支持向量机方法进行了性能评估和比较。此外,对一项癌症研究的基因组数据的分析表明,我们的方法对患者生存产生了更准确的预测模型,并揭示了比现有方法更有意义的生物学结果。
{"title":"Bayesian Non-linear Support Vector Machine for High-Dimensional Data with Incorporation of Graph Information on Features.","authors":"Wenli Sun,&nbsp;Changgee Chang,&nbsp;Qi Long","doi":"10.1109/bigdata47090.2019.9006473","DOIUrl":"https://doi.org/10.1109/bigdata47090.2019.9006473","url":null,"abstract":"<p><p>Support vector machine (SVM) is a popular classification method for analysis of high dimensional data such as genomics data. Recently a number of linear SVM methods have been developed to achieve feature selection through either frequentist regularization or Bayesian shrinkage, but the linear assumption may not be plausible for many real applications. In addition, recent work has demonstrated that incorporating known biological knowledge, such as those from functional genomics, into the statistical analysis of genomic data offers great promise of improved predictive accuracy and feature selection. Such biological knowledge can often be represented by graphs. In this article, we propose a novel knowledge-guided nonlinear Bayesian SVM approach for analysis of high-dimensional data. Our model uses graph information that represents the relationship among the features to guide feature selection. To achieve knowledge-guided feature selection, we assign an Ising prior to the indicators representing inclusion/exclusion of the features in the model. An efficient MCMC algorithm is developed for posterior inference. The performance of our method is evaluated and compared with several penalized linear SVM and the standard kernel SVM method in terms of prediction and feature selection in extensive simulation studies. Also, analyses of genomic data from a cancer study show that our method yields a more accurate prediction model for patient survival and reveals biologically more meaningful results than the existing methods.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":" ","pages":"4874-4882"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/bigdata47090.2019.9006473","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37975277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
bench4gis: Benchmarking Privacy-aware Geocoding with Open Big Data. bench4gis:利用开放大数据对隐私意识地理编码进行基准测试。
Pub Date : 2019-12-01 Epub Date: 2020-02-24 DOI: 10.1109/bigdata47090.2019.9006234
Daniel R Harris, Chris Delcher

Geocoding, the process of translating addresses to geographic coordinates, is a relatively straight-forward and well-studied process, but limitations due to privacy concerns may restrict usage of geographic data. The impact of these limitations are further compounded by the scale of the data, and in turn, also limits viable geocoding strategies. For example, healthcare data is protected by patient privacy laws in addition to possible institutional regulations that restrict external transmission and sharing of data. This results in the implementation of "in-house" geocoding solutions where data is processed behind an organization's firewall; quality assurance for these implementations is problematic because sensitive data cannot be used to externally validate results. In this paper, we present our software framework called bench4gis which benchmarks privacy-aware geocoding solutions by leveraging open big data as surrogate data for quality assurance; the scale of open big data sets for address data can ensure that results are geographically meaningful for the locale of the implementing institution.

地理编码,即将地址转换为地理坐标的过程,是一个相对直接且研究充分的过程,但由于隐私问题的限制可能会限制地理数据的使用。数据的规模进一步加剧了这些限制的影响,反过来,也限制了可行的地理编码策略。例如,医疗保健数据受到患者隐私法的保护,此外可能还受到限制数据外部传输和共享的机构法规的保护。这导致了“内部”地理编码解决方案的实现,其中数据在组织的防火墙后面处理;这些实现的质量保证存在问题,因为不能使用敏感数据从外部验证结果。在本文中,我们提出了名为bench4gis的软件框架,该框架通过利用开放大数据作为质量保证的替代数据,对隐私感知的地理编码解决方案进行基准测试;地址数据开放大数据集的规模可以确保结果对实施机构所在地具有地理意义。
{"title":"bench4gis: Benchmarking Privacy-aware Geocoding with Open Big Data.","authors":"Daniel R Harris,&nbsp;Chris Delcher","doi":"10.1109/bigdata47090.2019.9006234","DOIUrl":"https://doi.org/10.1109/bigdata47090.2019.9006234","url":null,"abstract":"<p><p>Geocoding, the process of translating addresses to geographic coordinates, is a relatively straight-forward and well-studied process, but limitations due to privacy concerns may restrict usage of geographic data. The impact of these limitations are further compounded by the scale of the data, and in turn, also limits viable geocoding strategies. For example, healthcare data is protected by patient privacy laws in addition to possible institutional regulations that restrict external transmission and sharing of data. This results in the implementation of \"in-house\" geocoding solutions where data is processed behind an organization's firewall; quality assurance for these implementations is problematic because sensitive data cannot be used to externally validate results. In this paper, we present our software framework called bench4gis which benchmarks privacy-aware geocoding solutions by leveraging open big data as surrogate data for quality assurance; the scale of open big data sets for address data can ensure that results are geographically meaningful for the locale of the implementing institution.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":" ","pages":"4067-4070"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/bigdata47090.2019.9006234","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37748544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Using hospital administrative data to infer patient-patient contact via the consistent co-presence algorithm. 利用医院管理数据,通过一致共在场算法推断患者-患者接触情况。
Pub Date : 2019-12-01 Epub Date: 2020-02-24 DOI: 10.1109/bigdata47090.2019.9006148
Jeffrey Lienert, Felix Reed-Tsochas, Laura Koehly, Christopher Steven Marcum

In health care settings, patients who are physically proximate to other patients (co-presence) for a meaningful amount of time may have differential health outcomes depending on who they are in contact with. How to best measure this co-presence, however is an open question and previous approaches have limitations that may make them inappropriate for complex health care settings. Here, we introduce a novel method which we term "consistent co-presence", that implicitly models the many complexities of patient scheduling and movement through a hospital by randomly perturbing the timing of patients' entry time into the health care system. This algorithm generates networks that can be employed in models of patient outcomes, such as 1-year mortality, and are preferred over previously established alternative algorithms from a model comparison perspective. These results indicate that consistent co-presence retains meaningful information about patient-patient interaction, which may affect outcomes relevant to health care practice. Furthermore, the generalizabiity of this approach allows it to be applied to a wide variety of complex systems.

在卫生保健机构中,在有意义的时间内与其他患者身体接近(共同存在)的患者可能会产生不同的健康结果,这取决于他们与谁接触。然而,如何最好地衡量这种共同存在是一个悬而未决的问题,以前的方法有局限性,可能使它们不适合复杂的卫生保健环境。在这里,我们引入了一种我们称之为“一致共存”的新方法,该方法通过随机干扰患者进入医疗保健系统的时间,隐含地模拟了患者在医院的日程安排和移动的许多复杂性。该算法生成的网络可用于患者预后(如1年死亡率)模型,并且从模型比较的角度来看,优于先前建立的替代算法。这些结果表明,一致的共同存在保留了有关患者相互作用的有意义的信息,这可能会影响与医疗保健实践相关的结果。此外,这种方法的通用性使其能够应用于各种各样的复杂系统。
{"title":"Using hospital administrative data to infer patient-patient contact via the consistent co-presence algorithm.","authors":"Jeffrey Lienert, Felix Reed-Tsochas, Laura Koehly, Christopher Steven Marcum","doi":"10.1109/bigdata47090.2019.9006148","DOIUrl":"10.1109/bigdata47090.2019.9006148","url":null,"abstract":"<p><p>In health care settings, patients who are physically proximate to other patients (co-presence) for a meaningful amount of time may have differential health outcomes depending on who they are in contact with. How to best measure this co-presence, however is an open question and previous approaches have limitations that may make them inappropriate for complex health care settings. Here, we introduce a novel method which we term \"consistent co-presence\", that <i>implicitly</i> models the many complexities of patient scheduling and movement through a hospital by randomly perturbing the timing of patients' entry time into the health care system. This algorithm generates networks that can be employed in models of patient outcomes, such as 1-year mortality, and are preferred over previously established alternative algorithms from a model comparison perspective. These results indicate that consistent co-presence retains meaningful information about patient-patient interaction, which may affect outcomes relevant to health care practice. Furthermore, the generalizabiity of this approach allows it to be applied to a wide variety of complex systems.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":"2019 ","pages":"2756-2762"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9745640/pdf/nihms-1854488.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10363094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1