首页 > 最新文献

Biodata Mining最新文献

英文 中文
6mA-StackingCV: an improved stacking ensemble model for predicting DNA N6-methyladenine site. 6mA-StackingCV:一种用于预测DNA n6 -甲基ladenine位点的改进的堆叠集成模型。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-11-27 DOI: 10.1186/s13040-023-00348-8
Guohua Huang, Xiaohong Huang, Wei Luo

DNA N6-adenine methylation (N6-methyladenine, 6mA) plays a key regulating role in the cellular processes. Precisely recognizing 6mA sites is of importance to further explore its biological functions. Although there are many developed computational methods for 6mA site prediction over the past decades, there is a large root left to improve. We presented a cross validation-based stacking ensemble model for 6mA site prediction, called 6mA-StackingCV. The 6mA-StackingCV is a type of meta-learning algorithm, which uses output of cross validation as input to the final classifier. The 6mA-StackingCV reached the state of the art performances in the Rosaceae independent test. Extensive tests demonstrated the stability and the flexibility of the 6mA-StackingCV. We implemented the 6mA-StackingCV as a user-friendly web application, which allows one to restrictively choose representations or learning algorithms. This application is freely available at http://www.biolscience.cn/6mA-stackingCV/ . The source code and experimental data is available at https://github.com/Xiaohong-source/6mA-stackingCV .

DNA n6 -腺嘌呤甲基化(n6 - methylladenine, 6mA)在细胞过程中起着关键的调节作用。准确识别6mA位点对进一步探索其生物学功能具有重要意义。在过去的几十年里,虽然有许多成熟的6mA场址预测计算方法,但仍有很大的改进余地。我们提出了一个基于交叉验证的用于6mA位点预测的堆叠集成模型,称为6mA- stackingcv。6mA-StackingCV是一种元学习算法,它使用交叉验证的输出作为最终分类器的输入。6mA-StackingCV在蔷薇科独立测试中达到了最先进的性能。广泛的测试证明了6mA-StackingCV的稳定性和灵活性。我们将6mA-StackingCV实现为一个用户友好的web应用程序,它允许人们限制性地选择表示或学习算法。该应用程序可在http://www.biolscience.cn/6mA-stackingCV/免费获得。源代码和实验数据可在https://github.com/Xiaohong-source/6mA-stackingCV上获得。
{"title":"6mA-StackingCV: an improved stacking ensemble model for predicting DNA N6-methyladenine site.","authors":"Guohua Huang, Xiaohong Huang, Wei Luo","doi":"10.1186/s13040-023-00348-8","DOIUrl":"10.1186/s13040-023-00348-8","url":null,"abstract":"<p><p>DNA N6-adenine methylation (N6-methyladenine, 6mA) plays a key regulating role in the cellular processes. Precisely recognizing 6mA sites is of importance to further explore its biological functions. Although there are many developed computational methods for 6mA site prediction over the past decades, there is a large root left to improve. We presented a cross validation-based stacking ensemble model for 6mA site prediction, called 6mA-StackingCV. The 6mA-StackingCV is a type of meta-learning algorithm, which uses output of cross validation as input to the final classifier. The 6mA-StackingCV reached the state of the art performances in the Rosaceae independent test. Extensive tests demonstrated the stability and the flexibility of the 6mA-StackingCV. We implemented the 6mA-StackingCV as a user-friendly web application, which allows one to restrictively choose representations or learning algorithms. This application is freely available at http://www.biolscience.cn/6mA-stackingCV/ . The source code and experimental data is available at https://github.com/Xiaohong-source/6mA-stackingCV .</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"34"},"PeriodicalIF":4.5,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10680251/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138446729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Endoscopy-based IBD identification by a quantized deep learning pipeline. 基于内窥镜的IBD量化深度学习管道识别。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-11-25 DOI: 10.1186/s13040-023-00350-0
Massimiliano Datres, Elisa Paolazzi, Marco Chierici, Matteo Pozzi, Antonio Colangelo, Marcello Dorian Donzella, Giuseppe Jurman

Background: Discrimination between patients affected by inflammatory bowel diseases and healthy controls on the basis of endoscopic imaging is an challenging problem for machine learning models. Such task is used here as the testbed for a novel deep learning classification pipeline, powered by a set of solutions enhancing characterising elements such as reproducibility, interpretability, reduced computational workload, bias-free modeling and careful image preprocessing.

Results: First, an automatic preprocessing procedure is devised, aimed to remove artifacts from clinical data, feeding then the resulting images to an aggregated per-patient model to mimic the clinicians decision process. The predictions are based on multiple snapshots obtained through resampling, reducing the risk of misleading outcomes by removing the low confidence predictions. Each patient's outcome is explained by returning the images the prediction is based upon, supporting clinicians in verifying diagnoses without the need for evaluating the full set of endoscopic images. As a major theoretical contribution, quantization is employed to reduce the complexity and the computational cost of the model, allowing its deployment on small power devices with an almost negligible 3% performance degradation. Such quantization procedure holds relevance not only in the context of per-patient models but also for assessing its feasibility in providing real-time support to clinicians even in low-resources environments. The pipeline is demonstrated on a private dataset of endoscopic images of 758 IBD patients and 601 healthy controls, achieving Matthews Correlation Coefficient 0.9 as top performance on test set.

Conclusion: We highlighted how a comprehensive pre-processing pipeline plays a crucial role in identifying and removing artifacts from data, solving one of the principal challenges encountered when working with clinical data. Furthermore, we constructively showed how it is possible to emulate clinicians decision process and how it offers significant advantages, particularly in terms of explainability and trust within the healthcare context. Last but not least, we proved that quantization can be a useful tool to reduce the time and resources consumption with an acceptable degradation of the model performs. The quantization study proposed in this work points up the potential development of real-time quantized algorithms as valuable tools to support clinicians during endoscopy procedures.

背景:基于内镜成像区分炎症性肠病患者和健康对照者是机器学习模型面临的一个具有挑战性的问题。这种任务在这里被用作新型深度学习分类管道的测试平台,由一组解决方案提供支持,这些解决方案增强了再现性、可解释性、减少计算工作量、无偏见建模和仔细的图像预处理等特征元素。结果:首先,设计了一个自动预处理程序,旨在从临床数据中去除伪影,然后将生成的图像输入到汇总的每个患者模型中,以模拟临床医生的决策过程。预测基于通过重新采样获得的多个快照,通过去除低置信度预测来降低误导性结果的风险。通过返回预测所基于的图像来解释每个患者的结果,支持临床医生验证诊断,而无需评估全套内窥镜图像。作为主要的理论贡献,量化被用于降低模型的复杂性和计算成本,允许其部署在小功率器件上,几乎可以忽略3%的性能下降。这种量化程序不仅在每个患者模型的背景下具有相关性,而且在评估其在低资源环境中为临床医生提供实时支持的可行性时也具有相关性。该管道在一个包含758名IBD患者和601名健康对照者的内窥镜图像的私有数据集上进行了演示,在测试集上达到了马修斯相关系数0.9的最佳性能。结论:我们强调了一个全面的预处理管道如何在识别和去除数据中的伪像方面发挥关键作用,解决了处理临床数据时遇到的主要挑战之一。此外,我们建设性地展示了如何模拟临床医生的决策过程,以及它如何提供显著的优势,特别是在医疗保健环境中的可解释性和信任方面。最后但并非最不重要的是,我们证明了量化可以是一个有用的工具,可以在可接受的模型性能下降的情况下减少时间和资源消耗。在这项工作中提出的量化研究指出了实时量化算法的潜在发展,作为有价值的工具,在内窥镜检查过程中支持临床医生。
{"title":"Endoscopy-based IBD identification by a quantized deep learning pipeline.","authors":"Massimiliano Datres, Elisa Paolazzi, Marco Chierici, Matteo Pozzi, Antonio Colangelo, Marcello Dorian Donzella, Giuseppe Jurman","doi":"10.1186/s13040-023-00350-0","DOIUrl":"10.1186/s13040-023-00350-0","url":null,"abstract":"<p><strong>Background: </strong>Discrimination between patients affected by inflammatory bowel diseases and healthy controls on the basis of endoscopic imaging is an challenging problem for machine learning models. Such task is used here as the testbed for a novel deep learning classification pipeline, powered by a set of solutions enhancing characterising elements such as reproducibility, interpretability, reduced computational workload, bias-free modeling and careful image preprocessing.</p><p><strong>Results: </strong>First, an automatic preprocessing procedure is devised, aimed to remove artifacts from clinical data, feeding then the resulting images to an aggregated per-patient model to mimic the clinicians decision process. The predictions are based on multiple snapshots obtained through resampling, reducing the risk of misleading outcomes by removing the low confidence predictions. Each patient's outcome is explained by returning the images the prediction is based upon, supporting clinicians in verifying diagnoses without the need for evaluating the full set of endoscopic images. As a major theoretical contribution, quantization is employed to reduce the complexity and the computational cost of the model, allowing its deployment on small power devices with an almost negligible 3% performance degradation. Such quantization procedure holds relevance not only in the context of per-patient models but also for assessing its feasibility in providing real-time support to clinicians even in low-resources environments. The pipeline is demonstrated on a private dataset of endoscopic images of 758 IBD patients and 601 healthy controls, achieving Matthews Correlation Coefficient 0.9 as top performance on test set.</p><p><strong>Conclusion: </strong>We highlighted how a comprehensive pre-processing pipeline plays a crucial role in identifying and removing artifacts from data, solving one of the principal challenges encountered when working with clinical data. Furthermore, we constructively showed how it is possible to emulate clinicians decision process and how it offers significant advantages, particularly in terms of explainability and trust within the healthcare context. Last but not least, we proved that quantization can be a useful tool to reduce the time and resources consumption with an acceptable degradation of the model performs. The quantization study proposed in this work points up the potential development of real-time quantized algorithms as valuable tools to support clinicians during endoscopy procedures.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"33"},"PeriodicalIF":4.5,"publicationDate":"2023-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10675910/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138435274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeepAutoGlioma: a deep learning autoencoder-based multi-omics data integration and classification tools for glioma subtyping. DeepAutoGlioma:一个基于深度学习自动编码器的多组学数据集成和分类工具,用于胶质瘤亚型分型。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-11-15 DOI: 10.1186/s13040-023-00349-7
Sana Munquad, Asim Bikas Das

Background and objective: The classification of glioma subtypes is essential for precision therapy. Due to the heterogeneity of gliomas, the subtype-specific molecular pattern can be captured by integrating and analyzing high-throughput omics data from different genomic layers. The development of a deep-learning framework enables the integration of multi-omics data to classify the glioma subtypes to support the clinical diagnosis.

Results: Transcriptome and methylome data of glioma patients were preprocessed, and differentially expressed features from both datasets were identified. Subsequently, a Cox regression analysis determined genes and CpGs associated with survival. Gene set enrichment analysis was carried out to examine the biological significance of the features. Further, we identified CpG and gene pairs by mapping them in the promoter region of corresponding genes. The methylation and gene expression levels of these CpGs and genes were embedded in a lower-dimensional space with an autoencoder. Next, ANN and CNN were used to classify subtypes using the latent features from embedding space. CNN performs better than ANN for subtyping lower-grade gliomas (LGG) and glioblastoma multiforme (GBM). The subtyping accuracy of CNN was 98.03% (± 0.06) and 94.07% (± 0.01) in LGG and GBM, respectively. The precision of the models was 97.67% in LGG and 90.40% in GBM. The model sensitivity was 96.96% in LGG and 91.18% in GBM. Additionally, we observed the superior performance of CNN with external datasets. The genes and CpGs pairs used to develop the model showed better performance than the random CpGs-gene pairs, preprocessed data, and single omics data.

Conclusions: The current study showed that a novel feature selection and data integration strategy led to the development of DeepAutoGlioma, an effective framework for diagnosing glioma subtypes.

背景与目的:胶质瘤亚型的分类是精确治疗的基础。由于胶质瘤的异质性,可以通过整合和分析来自不同基因组层的高通量组学数据来捕获亚型特异性分子模式。深度学习框架的开发使多组学数据的集成能够对胶质瘤亚型进行分类,以支持临床诊断。结果:对胶质瘤患者的转录组和甲基组数据进行预处理,并从两个数据集中识别出差异表达特征。随后,Cox回归分析确定了与生存相关的基因和CpGs。进行基因集富集分析以检验这些特征的生物学意义。此外,我们通过在相应基因的启动子区域定位CpG和基因对来鉴定它们。这些CpGs和基因的甲基化和基因表达水平通过自编码器嵌入到低维空间中。然后,利用嵌入空间的潜在特征,利用ANN和CNN对子类型进行分类。CNN对低级别胶质瘤(LGG)和多形性胶质母细胞瘤(GBM)的分型优于ANN。CNN在LGG和GBM的亚型分型准确率分别为98.03%(±0.06)和94.07%(±0.01)。模型在LGG和GBM中的精度分别为97.67%和90.40%。模型敏感性在LGG为96.96%,在GBM为91.18%。此外,我们观察到CNN在外部数据集上的优越性能。与随机CpGs-基因对、预处理数据和单组学数据相比,用于构建模型的基因和CpGs对具有更好的性能。结论:目前的研究表明,一种新的特征选择和数据整合策略导致了DeepAutoGlioma的发展,这是一种诊断胶质瘤亚型的有效框架。
{"title":"DeepAutoGlioma: a deep learning autoencoder-based multi-omics data integration and classification tools for glioma subtyping.","authors":"Sana Munquad, Asim Bikas Das","doi":"10.1186/s13040-023-00349-7","DOIUrl":"10.1186/s13040-023-00349-7","url":null,"abstract":"<p><strong>Background and objective: </strong>The classification of glioma subtypes is essential for precision therapy. Due to the heterogeneity of gliomas, the subtype-specific molecular pattern can be captured by integrating and analyzing high-throughput omics data from different genomic layers. The development of a deep-learning framework enables the integration of multi-omics data to classify the glioma subtypes to support the clinical diagnosis.</p><p><strong>Results: </strong>Transcriptome and methylome data of glioma patients were preprocessed, and differentially expressed features from both datasets were identified. Subsequently, a Cox regression analysis determined genes and CpGs associated with survival. Gene set enrichment analysis was carried out to examine the biological significance of the features. Further, we identified CpG and gene pairs by mapping them in the promoter region of corresponding genes. The methylation and gene expression levels of these CpGs and genes were embedded in a lower-dimensional space with an autoencoder. Next, ANN and CNN were used to classify subtypes using the latent features from embedding space. CNN performs better than ANN for subtyping lower-grade gliomas (LGG) and glioblastoma multiforme (GBM). The subtyping accuracy of CNN was 98.03% (± 0.06) and 94.07% (± 0.01) in LGG and GBM, respectively. The precision of the models was 97.67% in LGG and 90.40% in GBM. The model sensitivity was 96.96% in LGG and 91.18% in GBM. Additionally, we observed the superior performance of CNN with external datasets. The genes and CpGs pairs used to develop the model showed better performance than the random CpGs-gene pairs, preprocessed data, and single omics data.</p><p><strong>Conclusions: </strong>The current study showed that a novel feature selection and data integration strategy led to the development of DeepAutoGlioma, an effective framework for diagnosing glioma subtypes.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"32"},"PeriodicalIF":4.5,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10652591/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134650252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Research agenda for using artificial intelligence in health governance: interpretive scoping review and framework. 在卫生治理中使用人工智能的研究议程:解释性范围界定审查和框架。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-10-31 DOI: 10.1186/s13040-023-00346-w
Maryam Ramezani, Amirhossein Takian, Ahad Bakhtiari, Hamid R Rabiee, Sadegh Ghazanfari, Saharnaz Sazgarnejad

Background: The governance of health systems is complex in nature due to several intertwined and multi-dimensional factors contributing to it. Recent challenges of health systems reflect the need for innovative approaches that can minimize adverse consequences of policies. Hence, there is compelling evidence of a distinct outlook on the health ecosystem using artificial intelligence (AI). Therefore, this study aimed to investigate the roles of AI and its applications in health system governance through an interpretive scoping review of current evidence.

Method: This study intended to offer a research agenda and framework for the applications of AI in health systems governance. To include shreds of evidence with a greater focus on the application of AI in health governance from different perspectives, we searched the published literature from 2000 to 2023 through PubMed, Scopus, and Web of Science Databases.

Results: Our findings showed that integrating AI capabilities into health systems governance has the potential to influence three cardinal dimensions of health. These include social determinants of health, elements of governance, and health system tasks and goals. AI paves the way for strengthening the health system's governance through various aspects, i.e., intelligence innovations, flexible boundaries, multidimensional analysis, new insights, and cognition modifications to the health ecosystem area.

Conclusion: AI is expected to be seen as a tool with new applications and capabilities, with the potential to change each component of governance in the health ecosystem, which can eventually help achieve health-related goals.

背景:卫生系统的治理本质上是复杂的,这是由几个相互交织的多维度因素造成的。卫生系统最近面临的挑战反映出需要创新的方法,以最大限度地减少政策的不利后果。因此,有令人信服的证据表明,使用人工智能对健康生态系统有着独特的看法。因此,本研究旨在通过对现有证据的解释性范围审查,调查人工智能及其在卫生系统治理中的作用。方法:本研究旨在为人工智能在卫生系统治理中的应用提供一个研究议程和框架。为了从不同角度纳入更多关于人工智能在卫生治理中应用的证据,我们通过PubMed、Scopus和Web of Science数据库搜索了2000年至2023年发表的文献。结果:我们的研究结果表明,将人工智能能力融入卫生系统治理有可能影响健康的三个基本维度。其中包括健康的社会决定因素、治理要素以及卫生系统的任务和目标。人工智能通过各个方面为加强卫生系统的治理铺平了道路,即智能创新、灵活的边界、多维分析、新的见解和对卫生生态系统领域的认知修改。结论:人工智能有望被视为一种具有新应用和能力的工具,有可能改变健康生态系统中治理的每个组成部分,最终有助于实现与健康相关的目标。
{"title":"Research agenda for using artificial intelligence in health governance: interpretive scoping review and framework.","authors":"Maryam Ramezani,&nbsp;Amirhossein Takian,&nbsp;Ahad Bakhtiari,&nbsp;Hamid R Rabiee,&nbsp;Sadegh Ghazanfari,&nbsp;Saharnaz Sazgarnejad","doi":"10.1186/s13040-023-00346-w","DOIUrl":"10.1186/s13040-023-00346-w","url":null,"abstract":"<p><strong>Background: </strong>The governance of health systems is complex in nature due to several intertwined and multi-dimensional factors contributing to it. Recent challenges of health systems reflect the need for innovative approaches that can minimize adverse consequences of policies. Hence, there is compelling evidence of a distinct outlook on the health ecosystem using artificial intelligence (AI). Therefore, this study aimed to investigate the roles of AI and its applications in health system governance through an interpretive scoping review of current evidence.</p><p><strong>Method: </strong>This study intended to offer a research agenda and framework for the applications of AI in health systems governance. To include shreds of evidence with a greater focus on the application of AI in health governance from different perspectives, we searched the published literature from 2000 to 2023 through PubMed, Scopus, and Web of Science Databases.</p><p><strong>Results: </strong>Our findings showed that integrating AI capabilities into health systems governance has the potential to influence three cardinal dimensions of health. These include social determinants of health, elements of governance, and health system tasks and goals. AI paves the way for strengthening the health system's governance through various aspects, i.e., intelligence innovations, flexible boundaries, multidimensional analysis, new insights, and cognition modifications to the health ecosystem area.</p><p><strong>Conclusion: </strong>AI is expected to be seen as a tool with new applications and capabilities, with the potential to change each component of governance in the health ecosystem, which can eventually help achieve health-related goals.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"31"},"PeriodicalIF":4.5,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10617108/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71414915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction: A prognostic model based on seven immune-related genes predicts the overall survival of patients with hepatocellular carcinoma. 更正:基于七个免疫相关基因的预后模型预测了肝细胞癌患者的总体生存率。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-10-26 DOI: 10.1186/s13040-023-00347-9
Qian Yan, Wenjiang Zheng, Boqing Wang, Baoqian Ye, Huiyan Luo, Xinqian Yang, Ping Zhang, Xiongwen Wang
{"title":"Correction: A prognostic model based on seven immune-related genes predicts the overall survival of patients with hepatocellular carcinoma.","authors":"Qian Yan,&nbsp;Wenjiang Zheng,&nbsp;Boqing Wang,&nbsp;Baoqian Ye,&nbsp;Huiyan Luo,&nbsp;Xinqian Yang,&nbsp;Ping Zhang,&nbsp;Xiongwen Wang","doi":"10.1186/s13040-023-00347-9","DOIUrl":"10.1186/s13040-023-00347-9","url":null,"abstract":"","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"30"},"PeriodicalIF":4.5,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10605871/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"54231823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prescription pattern analysis of Type 2 Diabetes Mellitus: a cross-sectional study in Isfahan, Iran. 2型糖尿病的处方模式分析:伊朗伊斯法罕的一项横断面研究。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-10-20 DOI: 10.1186/s13040-023-00344-y
Elnaz Ziad, Somayeh Sadat, Farshad Farzadfar, Mohammad-Reza Malekpour

Background: Patients with Type 2 Diabetes Mellitus (T2DM) are at a higher risk of polypharmacy and more susceptible to irrational prescriptions; therefore, pharmacological therapy patterns are important to be monitored. The primary objective of this study was to highlight current prescription patterns in T2DM patients and compare them with existing Standards of Medical Care in Diabetes. The second objective was to analyze whether age and gender affect prescription patterns.

Method: This cross-sectional study was conducted using the Iran Health Insurance Organization (IHIO) prescription database. It was mined by an Association Rule Mining (ARM) technique, FP-Growth, in order to find co-prescribed drugs with anti-diabetic medications. The algorithm was implemented at different levels of the Anatomical Therapeutic Chemical (ATC) classification system, which assigns different codes to drugs based on their anatomy, pharmacological, therapeutic, and chemical properties to provide an in-depth analysis of co-prescription patterns.

Results: Altogether, the prescriptions of 914,652 patients were analyzed, of whom 91,505 were found to have diabetes. According to our results, prescribing Lipid Modifying Agents (C10) (56.3%), Agents Acting on The Renin-Angiotensin System (C09) (48.9%), Antithrombotic Agents (B01) (35.7%), and Beta Blocking Agents (C07) (30.1%) were meaningfully associated with the prescription of Drugs Used in Diabetes. Our study also revealed that female diabetic patients have a higher lift for taking Thyroid Preparations, and the older the patients were, the more they were prone to take neuropathy-related medications. Additionally, the results suggest that there are gender differences in the association between aspirin and diabetes drugs, with the differences becoming less pronounced in old age.

Conclusions: Almost all of the association rules found in this research were clinically meaningful, proving the potential of ARM for co-prescription pattern discovery. Moreover, implementing level-based ARM was effective in detecting difficult-to-spot rules. Additionally, the majority of drugs prescribed by physicians were consistent with the Standards of Medical Care in Diabetes.

背景:2型糖尿病(T2DM)患者服用多种药物的风险更高,更容易受到不合理处方的影响;因此,药物治疗模式的监测非常重要。本研究的主要目的是强调T2DM患者目前的处方模式,并将其与现有的糖尿病医疗护理标准进行比较。第二个目的是分析年龄和性别是否会影响处方模式。方法:这项横断面研究使用伊朗健康保险组织(IHIO)处方数据库进行。它是通过关联规则挖掘(ARM)技术FP Growth进行挖掘的,目的是找到与抗糖尿病药物合用的药物。该算法在解剖治疗化学(ATC)分类系统的不同级别上实现,该系统根据药物的解剖、药理学、治疗和化学特性为其分配不同的代码,以深入分析共同处方模式。结果:共分析914652例患者的处方,其中91505例为糖尿病患者。根据我们的研究结果,处方脂质修饰剂(C10)(56.3%)、作用于肾素-血管紧张素系统的药物(C09)(48.9%)、抗血栓药物(B01)(35.7%)和β-阻断剂(C07)(30.1%)与糖尿病药物的处方有显著相关性。我们的研究还表明,女性糖尿病患者服用甲状腺制剂的几率更高,而且患者年龄越大,就越容易服用与神经病变相关的药物。此外,研究结果表明,阿司匹林和糖尿病药物之间的相关性存在性别差异,这种差异在老年时变得不那么明显。结论:本研究中发现的几乎所有关联规则都具有临床意义,证明了ARM在发现联合处方模式方面的潜力。此外,实现基于层次的ARM在检测难以发现的规则方面是有效的。此外,医生开出的大多数药物都符合糖尿病医疗保健标准。
{"title":"Prescription pattern analysis of Type 2 Diabetes Mellitus: a cross-sectional study in Isfahan, Iran.","authors":"Elnaz Ziad, Somayeh Sadat, Farshad Farzadfar, Mohammad-Reza Malekpour","doi":"10.1186/s13040-023-00344-y","DOIUrl":"10.1186/s13040-023-00344-y","url":null,"abstract":"<p><strong>Background: </strong>Patients with Type 2 Diabetes Mellitus (T2DM) are at a higher risk of polypharmacy and more susceptible to irrational prescriptions; therefore, pharmacological therapy patterns are important to be monitored. The primary objective of this study was to highlight current prescription patterns in T2DM patients and compare them with existing Standards of Medical Care in Diabetes. The second objective was to analyze whether age and gender affect prescription patterns.</p><p><strong>Method: </strong>This cross-sectional study was conducted using the Iran Health Insurance Organization (IHIO) prescription database. It was mined by an Association Rule Mining (ARM) technique, FP-Growth, in order to find co-prescribed drugs with anti-diabetic medications. The algorithm was implemented at different levels of the Anatomical Therapeutic Chemical (ATC) classification system, which assigns different codes to drugs based on their anatomy, pharmacological, therapeutic, and chemical properties to provide an in-depth analysis of co-prescription patterns.</p><p><strong>Results: </strong>Altogether, the prescriptions of 914,652 patients were analyzed, of whom 91,505 were found to have diabetes. According to our results, prescribing Lipid Modifying Agents (C10) (56.3%), Agents Acting on The Renin-Angiotensin System (C09) (48.9%), Antithrombotic Agents (B01) (35.7%), and Beta Blocking Agents (C07) (30.1%) were meaningfully associated with the prescription of Drugs Used in Diabetes. Our study also revealed that female diabetic patients have a higher lift for taking Thyroid Preparations, and the older the patients were, the more they were prone to take neuropathy-related medications. Additionally, the results suggest that there are gender differences in the association between aspirin and diabetes drugs, with the differences becoming less pronounced in old age.</p><p><strong>Conclusions: </strong>Almost all of the association rules found in this research were clinically meaningful, proving the potential of ARM for co-prescription pattern discovery. Moreover, implementing level-based ARM was effective in detecting difficult-to-spot rules. Additionally, the majority of drugs prescribed by physicians were consistent with the Standards of Medical Care in Diabetes.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"29"},"PeriodicalIF":4.5,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10588025/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49683949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Attention-based dual-path feature fusion network for automatic skin lesion segmentation. 基于注意力的双路径特征融合网络用于皮肤损伤的自动分割。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-10-09 DOI: 10.1186/s13040-023-00345-x
Zhenxiang He, Xiaoxia Li, Yuling Chen, Nianzu Lv, Yong Cai

Automatic segmentation of skin lesions is a critical step in Computer Aided Diagnosis (CAD) of melanoma. However, due to the blurring of the lesion boundary, uneven color distribution, and low image contrast, resulting in poor segmentation result. Aiming at the problem of difficult segmentation of skin lesions, this paper proposes an Attention-based Dual-path Feature Fusion Network (ADFFNet) for automatic skin lesion segmentation. Firstly, in the spatial path, a Boundary Refinement (BR) module is designed for the output of low-level features to filter out irrelevant background information and retain more boundary details of the lesion area. Secondly, in the context path, a Multi-scale Feature Selection (MFS) module is constructed for high-level feature output to capture multi-scale context information and use the attention mechanism to filter out redundant semantic information. Finally, we design a Dual-path Feature Fusion (DFF) module, which uses high-level global attention information to guide the step-by-step fusion of high-level semantic features and low-level detail features, which is beneficial to restore image detail information and further improve the pixel-level segmentation accuracy of skin lesion. In the experiment, the ISIC 2018 and PH2 datasets are employed to evaluate the effectiveness of the proposed method. It achieves a performance of 0.890/ 0.925 and 0.933 /0.954 on the F1-score and SE index, respectively. Comparative analysis with state-of-the-art segmentation methods reveals that the ADFFNet algorithm exhibits superior segmentation performance.

皮肤病变的自动分割是黑色素瘤计算机辅助诊断(CAD)的关键步骤。然而,由于病变边界模糊,颜色分布不均匀,图像对比度低,导致分割效果差。针对皮肤病变分割困难的问题,提出了一种基于注意力的双路径特征融合网络(ADFNet)用于皮肤病变的自动分割。首先,在空间路径中,设计了一个边界细化(BR)模块,用于输出低级特征,以过滤掉不相关的背景信息,并保留病变区域的更多边界细节。其次,在上下文路径中,构造了一个多尺度特征选择(MFS)模块,用于高级特征输出,以捕获多尺度上下文信息,并利用注意力机制过滤掉冗余的语义信息。最后,我们设计了一个双路径特征融合(DFF)模块,该模块利用高级全局注意力信息来指导高级语义特征和低级细节特征的逐步融合,有利于恢复图像细节信息,进一步提高皮肤损伤的像素级分割精度。在实验中,使用ISIC 2018和PH2数据集来评估所提出方法的有效性。它在F1得分和SE指数上分别达到0.890/0.925和0.933/0.954。与最先进的分割方法的比较分析表明,ADFNet算法表现出优越的分割性能。
{"title":"Attention-based dual-path feature fusion network for automatic skin lesion segmentation.","authors":"Zhenxiang He, Xiaoxia Li, Yuling Chen, Nianzu Lv, Yong Cai","doi":"10.1186/s13040-023-00345-x","DOIUrl":"10.1186/s13040-023-00345-x","url":null,"abstract":"<p><p>Automatic segmentation of skin lesions is a critical step in Computer Aided Diagnosis (CAD) of melanoma. However, due to the blurring of the lesion boundary, uneven color distribution, and low image contrast, resulting in poor segmentation result. Aiming at the problem of difficult segmentation of skin lesions, this paper proposes an Attention-based Dual-path Feature Fusion Network (ADFFNet) for automatic skin lesion segmentation. Firstly, in the spatial path, a Boundary Refinement (BR) module is designed for the output of low-level features to filter out irrelevant background information and retain more boundary details of the lesion area. Secondly, in the context path, a Multi-scale Feature Selection (MFS) module is constructed for high-level feature output to capture multi-scale context information and use the attention mechanism to filter out redundant semantic information. Finally, we design a Dual-path Feature Fusion (DFF) module, which uses high-level global attention information to guide the step-by-step fusion of high-level semantic features and low-level detail features, which is beneficial to restore image detail information and further improve the pixel-level segmentation accuracy of skin lesion. In the experiment, the ISIC 2018 and PH2 datasets are employed to evaluate the effectiveness of the proposed method. It achieves a performance of 0.890/ 0.925 and 0.933 /0.954 on the F1-score and SE index, respectively. Comparative analysis with state-of-the-art segmentation methods reveals that the ADFFNet algorithm exhibits superior segmentation performance.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"28"},"PeriodicalIF":4.5,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10561442/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41155445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quantum analysis of squiggle data. 扭曲数据的量子分析。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-10-06 DOI: 10.1186/s13040-023-00343-z
Naya Nagy, Matthew Stuart-Edwards, Marius Nagy, Liam Mitchell, Athanasios Zovoilis

Squiggle data is the numerical output of DNA and RNA sequencing by the Nanopore next generation sequencing platform. Nanopore sequencing offers expanded applications compared to previous sequencing techniques but produces a large amount of data in the form of current measurements over time. The analysis of these segments of current measurements require more complex and computationally intensive algorithms than previous sequencing technologies. The purpose of this study is to investigate in principle the potential of using quantum computers to speed up Nanopore data analysis. Quantum circuits are designed to extract major features of squiggle current measurements. The circuits are analyzed theoretically in terms of size and performance. Practical experiments on IBM QX show the limitations of the state of the art quantum computer to tackle real life squiggle data problems. Nevertheless, pre-processing of the squiggle data using the inverse wavelet transform, as experimented and analyzed in this paper as well, reduces the dimensionality of the problem in order to fit a reasonable size quantum computer in the hopefully near future.

波形数据是纳米孔下一代测序平台对DNA和RNA测序的数字输出。与以前的测序技术相比,纳米孔测序提供了更广泛的应用,但随着时间的推移,会以当前测量的形式产生大量数据。与以前的测序技术相比,对当前测量的这些片段的分析需要更复杂和计算密集的算法。本研究的目的是从原理上研究使用量子计算机加速纳米孔数据分析的潜力。量子电路设计用于提取波形电流测量的主要特征。从尺寸和性能方面对电路进行了理论分析。在IBM QX上进行的实际实验表明,最先进的量子计算机在解决现实生活中的数据问题方面存在局限性。然而,正如本文所实验和分析的那样,使用小波逆变换对波形数据进行预处理,降低了问题的维数,以便在不久的将来适合一台合理尺寸的量子计算机。
{"title":"Quantum analysis of squiggle data.","authors":"Naya Nagy, Matthew Stuart-Edwards, Marius Nagy, Liam Mitchell, Athanasios Zovoilis","doi":"10.1186/s13040-023-00343-z","DOIUrl":"10.1186/s13040-023-00343-z","url":null,"abstract":"<p><p>Squiggle data is the numerical output of DNA and RNA sequencing by the Nanopore next generation sequencing platform. Nanopore sequencing offers expanded applications compared to previous sequencing techniques but produces a large amount of data in the form of current measurements over time. The analysis of these segments of current measurements require more complex and computationally intensive algorithms than previous sequencing technologies. The purpose of this study is to investigate in principle the potential of using quantum computers to speed up Nanopore data analysis. Quantum circuits are designed to extract major features of squiggle current measurements. The circuits are analyzed theoretically in terms of size and performance. Practical experiments on IBM QX show the limitations of the state of the art quantum computer to tackle real life squiggle data problems. Nevertheless, pre-processing of the squiggle data using the inverse wavelet transform, as experimented and analyzed in this paper as well, reduces the dimensionality of the problem in order to fit a reasonable size quantum computer in the hopefully near future.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"27"},"PeriodicalIF":4.5,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10557310/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41135068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Disclosing transcriptomics network-based signatures of glioma heterogeneity using sparse methods. 使用稀疏方法揭示神经胶质瘤异质性的基于转录组学的特征。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-09-26 DOI: 10.1186/s13040-023-00341-1
Sofia Martins, Roberta Coletti, Marta B Lopes

Gliomas are primary malignant brain tumors with poor survival and high resistance to available treatments. Improving the molecular understanding of glioma and disclosing novel biomarkers of tumor development and progression could help to find novel targeted therapies for this type of cancer. Public databases such as The Cancer Genome Atlas (TCGA) provide an invaluable source of molecular information on cancer tissues. Machine learning tools show promise in dealing with the high dimension of omics data and extracting relevant information from it. In this work, network inference and clustering methods, namely Joint Graphical lasso and Robust Sparse K-means Clustering, were applied to RNA-sequencing data from TCGA glioma patients to identify shared and distinct gene networks among different types of glioma (glioblastoma, astrocytoma, and oligodendroglioma) and disclose new patient groups and the relevant genes behind groups' separation. The results obtained suggest that astrocytoma and oligodendroglioma have more similarities compared with glioblastoma, highlighting the molecular differences between glioblastoma and the others glioma subtypes. After a comprehensive literature search on the relevant genes pointed our from our analysis, we identified potential candidates for biomarkers of glioma. Further molecular validation of these genes is encouraged to understand their potential role in diagnosis and in the design of novel therapies.

胶质瘤是原发性恶性脑肿瘤,生存率低,对现有治疗方法的耐药性高。提高对神经胶质瘤的分子理解并揭示肿瘤发展和进展的新生物标志物可能有助于找到这种类型癌症的新靶向治疗方法。癌症基因组图谱(TCGA)等公共数据库为癌症组织的分子信息提供了宝贵的来源。机器学习工具在处理高维组学数据并从中提取相关信息方面表现出了良好的前景,应用于TCGA神经胶质瘤患者的RNA测序数据,以确定不同类型神经胶质瘤(胶质母细胞瘤、星形细胞瘤和少突胶质瘤)之间共享和不同的基因网络,并揭示新的患者群体和群体分离背后的相关基因。结果表明,与胶质母细胞瘤相比,星形细胞瘤和少突胶质瘤有更多的相似性,突出了胶质母细胞癌与其他胶质瘤亚型之间的分子差异。在对我们分析的相关基因进行全面的文献检索后,我们确定了神经胶质瘤生物标志物的潜在候选者。鼓励对这些基因进行进一步的分子验证,以了解它们在诊断和新疗法设计中的潜在作用。
{"title":"Disclosing transcriptomics network-based signatures of glioma heterogeneity using sparse methods.","authors":"Sofia Martins, Roberta Coletti, Marta B Lopes","doi":"10.1186/s13040-023-00341-1","DOIUrl":"10.1186/s13040-023-00341-1","url":null,"abstract":"<p><p>Gliomas are primary malignant brain tumors with poor survival and high resistance to available treatments. Improving the molecular understanding of glioma and disclosing novel biomarkers of tumor development and progression could help to find novel targeted therapies for this type of cancer. Public databases such as The Cancer Genome Atlas (TCGA) provide an invaluable source of molecular information on cancer tissues. Machine learning tools show promise in dealing with the high dimension of omics data and extracting relevant information from it. In this work, network inference and clustering methods, namely Joint Graphical lasso and Robust Sparse K-means Clustering, were applied to RNA-sequencing data from TCGA glioma patients to identify shared and distinct gene networks among different types of glioma (glioblastoma, astrocytoma, and oligodendroglioma) and disclose new patient groups and the relevant genes behind groups' separation. The results obtained suggest that astrocytoma and oligodendroglioma have more similarities compared with glioblastoma, highlighting the molecular differences between glioblastoma and the others glioma subtypes. After a comprehensive literature search on the relevant genes pointed our from our analysis, we identified potential candidates for biomarkers of glioma. Further molecular validation of these genes is encouraged to understand their potential role in diagnosis and in the design of novel therapies.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"26"},"PeriodicalIF":4.5,"publicationDate":"2023-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10523751/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41161853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
STAR_outliers: a python package that separates univariate outliers from non-normal distributions. STAR_outliers:一个python包,用于从非正态分布中分离单变量异常值。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-09-04 DOI: 10.1186/s13040-023-00342-0
John T Gregg, Jason H Moore
<p><p>There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and monotonicity. To overcome these challenges, we have implemented an algorithm for Skew and Tail-heaviness Adjusted Removal of Outliers (STAR_outliers) that robustly removes univariate outliers from distributions with many different shape profiles, including extreme skew, extreme kurtosis, bimodality, and monotonicity. We show that STAR_outliers removes simulated outliers with greater recall and precision than several general algorithms, and it also models the outlier bounds of real data distributions with greater accuracy.Background Reliably removing univariate outliers from arbitrarily shaped distributions is a difficult task. Incorrectly assuming unimodality or overestimating tail heaviness fails to remove outliers, while underestimating tail heaviness incorrectly removes regular data from the tails. Skew often produces one heavy tail and one light tail, and we show that several sophisticated outlier removal algorithms often fail to remove outliers from the light tail. Multivariate outlier detection algorithms have recently become popular, but having tested PyOD's multivariate outlier removal algorithms, we found them to be inadequate for univariate outlier removal. They usually do not allow for univariate input, and they do not fit their distributions of outliership scores with a model on which an outlier threshold can be accurately established. Thus, there is a need for a flexible outlier removal algorithm that can model arbitrarily shaped univariate distributions.Results In order to effectively model arbitrarily shaped univariate distributions, we have combined several well-established algorithms into a new algorithm called STAR_outliers. STAR_outliers removes more simulated true outliers and fewer non-outliers than several other univariate algorithms. These include several normality-assuming outlier removal methods, PyOD's isolation forest (IF) outlier removal algorithm (ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012) with default settings, and an IQR based algorithm by Verardi and Vermandele that removes outliers while accounting for skew and kurtosis (Verardi and Vermandele, Journal de la Société Française de Statistique 157:90-114, 2016). Since the IF algorithm's default model poorly fit the outliership scores, we also compared the isolation forest algorithm with a model that entails removing as many datapoints as STAR_outliers does in order of decreasing outliership scores. We also compared these algorithms on the publicly available 2018 National Health and Nutrition Examination Survey (NHANES) data by setting the outlier threshold to keep values falling within the main 99.3 percent of the fitted model's domain. We show that our STAR_outliers algorithm removes signif
目前还没有任何单变量离群点检测算法,可以对任意形状的分布进行变换和建模,以去除单变量离群点。有些算法建模偏态,甚至更少建模峰度,没有一个算法建模双峰性和单调性。为了克服这些挑战,我们实现了一种针对偏度和尾重调整的异常值去除(STAR_outliers)的算法,该算法可以从具有许多不同形状轮廓的分布中稳健地去除单变量异常值,包括极端偏度、极端峰度、双峰性和单调性。我们表明,STAR_outliers比几种通用算法具有更高的召回率和精度来去除模拟的异常值,并且它还以更高的精度建模真实数据分布的异常边界。从任意形状的分布中可靠地去除单变量异常值是一项艰巨的任务。错误地假设单峰性或高估尾重不能去除异常值,而低估尾重则错误地从尾部去除常规数据。偏态通常会产生一条重尾和一条轻尾,我们表明一些复杂的离群值去除算法通常不能从轻尾中去除离群值。多元离群值检测算法最近变得很流行,但在测试了PyOD的多元离群值去除算法后,我们发现它们对于单变量离群值去除是不够的。它们通常不允许单变量输入,并且它们的异常值分数分布不能与可以准确建立异常值阈值的模型相拟合。因此,需要一种灵活的离群值去除算法来模拟任意形状的单变量分布。为了有效地模拟任意形状的单变量分布,我们将几种成熟的算法组合成一个名为STAR_outliers的新算法。与其他几种单变量算法相比,STAR_outliers删除了更多模拟的真实异常值和更少的非异常值。其中包括几种假设正态性的离群值去除方法,PyOD的隔离森林(IF)离群值去除算法(ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012)默认设置,以及Verardi和Vermandele基于IQR的算法,该算法在考虑偏态和峰度的同时去除离群值(Verardi和Vermandele, Journal de la sociacims francalaise de statisque 157:90-114, 2016)。由于IF算法的默认模型不能很好地拟合离群值得分,因此我们还将隔离森林算法与一个模型进行了比较,该模型需要删除尽可能多的数据点,如STAR_outliers按照离群值得分的递减顺序删除。我们还将这些算法与公开的2018年国家健康和营养检查调查(NHANES)数据进行了比较,设置了异常值阈值,使数值落在拟合模型域的99.3%以内。我们发现,平均而言,我们的STAR_outliers算法比其他离群值去除方法从这些特征中去除的值明显接近0.7%。STAR_outliers是一个易于实现的python包,用于去除异常值,优于多种常用的单变量异常值去除方法。
{"title":"STAR_outliers: a python package that separates univariate outliers from non-normal distributions.","authors":"John T Gregg, Jason H Moore","doi":"10.1186/s13040-023-00342-0","DOIUrl":"10.1186/s13040-023-00342-0","url":null,"abstract":"&lt;p&gt;&lt;p&gt;There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and monotonicity. To overcome these challenges, we have implemented an algorithm for Skew and Tail-heaviness Adjusted Removal of Outliers (STAR_outliers) that robustly removes univariate outliers from distributions with many different shape profiles, including extreme skew, extreme kurtosis, bimodality, and monotonicity. We show that STAR_outliers removes simulated outliers with greater recall and precision than several general algorithms, and it also models the outlier bounds of real data distributions with greater accuracy.Background Reliably removing univariate outliers from arbitrarily shaped distributions is a difficult task. Incorrectly assuming unimodality or overestimating tail heaviness fails to remove outliers, while underestimating tail heaviness incorrectly removes regular data from the tails. Skew often produces one heavy tail and one light tail, and we show that several sophisticated outlier removal algorithms often fail to remove outliers from the light tail. Multivariate outlier detection algorithms have recently become popular, but having tested PyOD's multivariate outlier removal algorithms, we found them to be inadequate for univariate outlier removal. They usually do not allow for univariate input, and they do not fit their distributions of outliership scores with a model on which an outlier threshold can be accurately established. Thus, there is a need for a flexible outlier removal algorithm that can model arbitrarily shaped univariate distributions.Results In order to effectively model arbitrarily shaped univariate distributions, we have combined several well-established algorithms into a new algorithm called STAR_outliers. STAR_outliers removes more simulated true outliers and fewer non-outliers than several other univariate algorithms. These include several normality-assuming outlier removal methods, PyOD's isolation forest (IF) outlier removal algorithm (ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012) with default settings, and an IQR based algorithm by Verardi and Vermandele that removes outliers while accounting for skew and kurtosis (Verardi and Vermandele, Journal de la Société Française de Statistique 157:90-114, 2016). Since the IF algorithm's default model poorly fit the outliership scores, we also compared the isolation forest algorithm with a model that entails removing as many datapoints as STAR_outliers does in order of decreasing outliership scores. We also compared these algorithms on the publicly available 2018 National Health and Nutrition Examination Survey (NHANES) data by setting the outlier threshold to keep values falling within the main 99.3 percent of the fitted model's domain. We show that our STAR_outliers algorithm removes signif","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"25"},"PeriodicalIF":4.5,"publicationDate":"2023-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10476292/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10166430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biodata Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1