首页 > 最新文献

JMIR bioinformatics and biotechnology最新文献

英文 中文
Assessing Privacy Vulnerabilities in Genetic Data Sets: Scoping Review. 评估遗传数据集中的隐私漏洞:范围审查。
Pub Date : 2024-05-27 DOI: 10.2196/54332
Mara Thomas, Nuria Mackes, Asad Preuss-Dodhy, Thomas Wieland, Markus Bundschus

Background: Genetic data are widely considered inherently identifiable. However, genetic data sets come in many shapes and sizes, and the feasibility of privacy attacks depends on their specific content. Assessing the reidentification risk of genetic data is complex, yet there is a lack of guidelines or recommendations that support data processors in performing such an evaluation.

Objective: This study aims to gain a comprehensive understanding of the privacy vulnerabilities of genetic data and create a summary that can guide data processors in assessing the privacy risk of genetic data sets.

Methods: We conducted a 2-step search, in which we first identified 21 reviews published between 2017 and 2023 on the topic of genomic privacy and then analyzed all references cited in the reviews (n=1645) to identify 42 unique original research studies that demonstrate a privacy attack on genetic data. We then evaluated the type and components of genetic data exploited for these attacks as well as the effort and resources needed for their implementation and their probability of success.

Results: From our literature review, we derived 9 nonmutually exclusive features of genetic data that are both inherent to any genetic data set and informative about privacy risk: biological modality, experimental assay, data format or level of processing, germline versus somatic variation content, content of single nucleotide polymorphisms, short tandem repeats, aggregated sample measures, structural variants, and rare single nucleotide variants.

Conclusions: On the basis of our literature review, the evaluation of these 9 features covers the great majority of privacy-critical aspects of genetic data and thus provides a foundation and guidance for assessing genetic data risk.

背景:人们普遍认为基因数据本身具有可识别性。然而,基因数据集有多种形状和大小,隐私攻击的可行性取决于其具体内容。评估基因数据的再识别风险非常复杂,但目前还缺乏支持数据处理人员进行此类评估的指南或建议:本研究旨在全面了解基因数据的隐私漏洞,并编写一份摘要,指导数据处理人员评估基因数据集的隐私风险:我们进行了两步搜索,首先确定了 2017 年至 2023 年间发表的 21 篇以基因组隐私为主题的综述,然后分析了综述中引用的所有参考文献(n=1645),确定了 42 项证明基因数据隐私攻击的独特原创研究。然后,我们评估了这些攻击所利用的基因数据的类型和组成部分,以及实施这些攻击所需的努力和资源及其成功概率:根据我们的文献综述,我们得出了基因数据的 9 个非相互排斥的特征,这些特征既是任何基因数据集的固有特征,也是隐私风险的信息来源:生物模式、实验检测、数据格式或处理水平、种系变异与体细胞变异内容、单核苷酸多态性内容、短串联重复序列、聚合样本测量、结构变异和罕见单核苷酸变异:根据我们的文献综述,对这 9 个特征的评估涵盖了基因数据中绝大多数对隐私至关重要的方面,从而为评估基因数据风险提供了基础和指导。
{"title":"Assessing Privacy Vulnerabilities in Genetic Data Sets: Scoping Review.","authors":"Mara Thomas, Nuria Mackes, Asad Preuss-Dodhy, Thomas Wieland, Markus Bundschus","doi":"10.2196/54332","DOIUrl":"https://doi.org/10.2196/54332","url":null,"abstract":"<p><strong>Background: </strong>Genetic data are widely considered inherently identifiable. However, genetic data sets come in many shapes and sizes, and the feasibility of privacy attacks depends on their specific content. Assessing the reidentification risk of genetic data is complex, yet there is a lack of guidelines or recommendations that support data processors in performing such an evaluation.</p><p><strong>Objective: </strong>This study aims to gain a comprehensive understanding of the privacy vulnerabilities of genetic data and create a summary that can guide data processors in assessing the privacy risk of genetic data sets.</p><p><strong>Methods: </strong>We conducted a 2-step search, in which we first identified 21 reviews published between 2017 and 2023 on the topic of genomic privacy and then analyzed all references cited in the reviews (n=1645) to identify 42 unique original research studies that demonstrate a privacy attack on genetic data. We then evaluated the type and components of genetic data exploited for these attacks as well as the effort and resources needed for their implementation and their probability of success.</p><p><strong>Results: </strong>From our literature review, we derived 9 nonmutually exclusive features of genetic data that are both inherent to any genetic data set and informative about privacy risk: biological modality, experimental assay, data format or level of processing, germline versus somatic variation content, content of single nucleotide polymorphisms, short tandem repeats, aggregated sample measures, structural variants, and rare single nucleotide variants.</p><p><strong>Conclusions: </strong>On the basis of our literature review, the evaluation of these 9 features covers the great majority of privacy-critical aspects of genetic data and thus provides a foundation and guidance for assessing genetic data risk.</p>","PeriodicalId":73552,"journal":{"name":"JMIR bioinformatics and biotechnology","volume":"5 ","pages":"e54332"},"PeriodicalIF":0.0,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11165293/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141473269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Roles of NOTCH3 p.R544C and Thrombophilia Genes in Vietnamese Patients With Ischemic Stroke: Study Involving a Hierarchical Cluster Analysis NOTCH3 p.R544C 和血栓性疾病基因在越南缺血性中风患者中的作用:分层聚类分析研究
Pub Date : 2024-05-07 DOI: 10.2196/56884
Huong Thi Thu Bui, Quỳnh Nguyễn Thị Phương, Ho Cam Tu, Sinh Nguyen Phuong, Thuy Thi Pham, Thu Vu, Huyen Nguyen Thi Thu, Lam Khanh Ho, Dung Nguyen Tien
The etiology of ischemic stroke is multifactorial. Several gene mutations have been identified as leading causes of cerebral autosomal dominant arteriopathy with subcortical infarcts and leukoencephalopathy (CADASIL), a hereditary disease that causes stroke and other neurological symptoms. We aimed to identify the variants of NOTCH3 and thrombophilia genes, and their complex interactions with other factors. We conducted a hierarchical cluster analysis (HCA) on the data of 100 patients diagnosed with ischemic stroke. The variants of NOTCH3 and thrombophilia genes were identified by polymerase chain reaction with confronting 2-pair primers and real-time polymerase chain reaction. The overall preclinical characteristics, cumulative cutpoint values, and factors associated with these somatic mutations were analyzed in unidimensional and multidimensional scaling models. We identified the following optimal cutpoints: creatinine, 83.67 (SD 9.19) µmol/L; age, 54 (SD 5) years; prothrombin (PT) time, 13.25 (SD 0.17) seconds; and international normalized ratio (INR), 1.02 (SD 0.03). Using the Nagelkerke method, cutpoint 50% values of the Glasgow Coma Scale score; modified Rankin scale score; and National Institutes of Health Stroke Scale scores at admission, after 24 hours, and at discharge were 12.77, 2.86 (SD 1.21), 9.83 (SD 2.85), 7.29 (SD 2.04), and 6.85 (SD 2.90), respectively. The variants of MTHFR (C677T and A1298C) and NOTCH3 p.R544C may influence the stroke severity under specific conditions of PT, creatinine, INR, and BMI, with risk ratios of 4.8 (95% CI 1.53-15.04) and 3.13 (95% CI 1.60-6.11), respectively (Pfisher<.05). It is interesting that although there are many genes linked to increased atrial fibrillation risk, not all of them are associated with ischemic stroke risk. With the detection of stroke risk loci, more information can be gained on their impacts and interconnections, especially in young patients.
缺血性中风的病因是多因素的。脑常染色体显性动脉病伴有皮层下梗死和白质脑病(CADASIL)是一种遗传性疾病,可导致中风和其他神经症状,目前已发现多个基因突变是导致该病的主要原因。 我们的目的是确定 NOTCH3 和血栓性疾病基因的变异及其与其他因素的复杂相互作用。 我们对 100 名确诊为缺血性中风的患者数据进行了分层聚类分析(HCA)。通过使用 2 对引物的聚合酶链反应和实时聚合酶链反应鉴定了 NOTCH3 和血栓性疾病基因的变异。通过单维和多维标度模型分析了这些体细胞突变的总体临床前特征、累积切点值和相关因素。 我们确定了以下最佳切点:肌酐 83.67 (SD 9.19) µmol/L;年龄 54 (SD 5)岁;凝血酶原 (PT) 时间 13.25 (SD 0.17) 秒;国际标准化比值 (INR) 1.02 (SD 0.03)。采用纳格尔克尔克法,入院时、24 小时后和出院时格拉斯哥昏迷量表评分、改良兰金量表评分和美国国立卫生研究院卒中量表评分的切点 50% 值分别为 12.77、2.86(标清 1.21)、9.83(标清 2.85)、7.29(标清 2.04)和 6.85(标清 2.90)。 在 PT、肌酐、INR 和 BMI 的特定条件下,MTHFR(C677T 和 A1298C)和 NOTCH3 p.R544C 变异可能会影响卒中的严重程度,风险比分别为 4.8(95% CI 1.53-15.04)和 3.13(95% CI 1.60-6.11)(Pfisher<.05)。有趣的是,虽然有许多基因与心房颤动风险增加有关,但并非所有基因都与缺血性中风风险有关。随着中风风险基因位点的发现,可以获得更多关于其影响和相互联系的信息,尤其是在年轻患者中。
{"title":"The Roles of NOTCH3 p.R544C and Thrombophilia Genes in Vietnamese Patients With Ischemic Stroke: Study Involving a Hierarchical Cluster Analysis","authors":"Huong Thi Thu Bui, Quỳnh Nguyễn Thị Phương, Ho Cam Tu, Sinh Nguyen Phuong, Thuy Thi Pham, Thu Vu, Huyen Nguyen Thi Thu, Lam Khanh Ho, Dung Nguyen Tien","doi":"10.2196/56884","DOIUrl":"https://doi.org/10.2196/56884","url":null,"abstract":"\u0000 \u0000 The etiology of ischemic stroke is multifactorial. Several gene mutations have been identified as leading causes of cerebral autosomal dominant arteriopathy with subcortical infarcts and leukoencephalopathy (CADASIL), a hereditary disease that causes stroke and other neurological symptoms.\u0000 \u0000 \u0000 \u0000 We aimed to identify the variants of NOTCH3 and thrombophilia genes, and their complex interactions with other factors.\u0000 \u0000 \u0000 \u0000 We conducted a hierarchical cluster analysis (HCA) on the data of 100 patients diagnosed with ischemic stroke. The variants of NOTCH3 and thrombophilia genes were identified by polymerase chain reaction with confronting 2-pair primers and real-time polymerase chain reaction. The overall preclinical characteristics, cumulative cutpoint values, and factors associated with these somatic mutations were analyzed in unidimensional and multidimensional scaling models.\u0000 \u0000 \u0000 \u0000 We identified the following optimal cutpoints: creatinine, 83.67 (SD 9.19) µmol/L; age, 54 (SD 5) years; prothrombin (PT) time, 13.25 (SD 0.17) seconds; and international normalized ratio (INR), 1.02 (SD 0.03). Using the Nagelkerke method, cutpoint 50% values of the Glasgow Coma Scale score; modified Rankin scale score; and National Institutes of Health Stroke Scale scores at admission, after 24 hours, and at discharge were 12.77, 2.86 (SD 1.21), 9.83 (SD 2.85), 7.29 (SD 2.04), and 6.85 (SD 2.90), respectively.\u0000 \u0000 \u0000 \u0000 The variants of MTHFR (C677T and A1298C) and NOTCH3 p.R544C may influence the stroke severity under specific conditions of PT, creatinine, INR, and BMI, with risk ratios of 4.8 (95% CI 1.53-15.04) and 3.13 (95% CI 1.60-6.11), respectively (Pfisher<.05). It is interesting that although there are many genes linked to increased atrial fibrillation risk, not all of them are associated with ischemic stroke risk. With the detection of stroke risk loci, more information can be gained on their impacts and interconnections, especially in young patients.\u0000","PeriodicalId":73552,"journal":{"name":"JMIR bioinformatics and biotechnology","volume":"95 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141002322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ChatGPT and Medicine: Together We Embrace the AI Renaissance ChatGPT 与医学:携手迎接人工智能文艺复兴
Pub Date : 2024-05-07 DOI: 10.2196/52700
Sean Hacking
The generative artificial intelligence (AI) model ChatGPT holds transformative prospects in medicine. The development of such models has signaled the beginning of a new era where complex biological data can be made more accessible and interpretable. ChatGPT is a natural language processing tool that can process, interpret, and summarize vast data sets. It can serve as a digital assistant for physicians and researchers, aiding in integrating medical imaging data with other multiomics data and facilitating the understanding of complex biological systems. The physician’s and AI’s viewpoints emphasize the value of such AI models in medicine, providing tangible examples of how this could enhance patient care. The editorial also discusses the rise of generative AI, highlighting its substantial impact in democratizing AI applications for modern medicine. While AI may not supersede health care professionals, practitioners incorporating AI into their practices could potentially have a competitive edge.
生成式人工智能(AI)模型 ChatGPT 在医学领域具有变革性的前景。这种模型的开发标志着一个新时代的开始,在这个时代,复杂的生物数据变得更容易获取和解读。ChatGPT 是一种自然语言处理工具,可以处理、解释和总结庞大的数据集。它可以作为医生和研究人员的数字助手,帮助整合医学影像数据和其他多组学数据,促进对复杂生物系统的理解。医生和人工智能的观点强调了这种人工智能模型在医学中的价值,并提供了如何加强病人护理的具体实例。社论还讨论了生成式人工智能的兴起,强调了它对现代医学人工智能应用民主化的重大影响。虽然人工智能可能不会取代医疗保健专业人员,但将人工智能纳入其实践的从业人员有可能获得竞争优势。
{"title":"ChatGPT and Medicine: Together We Embrace the AI Renaissance","authors":"Sean Hacking","doi":"10.2196/52700","DOIUrl":"https://doi.org/10.2196/52700","url":null,"abstract":"The generative artificial intelligence (AI) model ChatGPT holds transformative prospects in medicine. The development of such models has signaled the beginning of a new era where complex biological data can be made more accessible and interpretable. ChatGPT is a natural language processing tool that can process, interpret, and summarize vast data sets. It can serve as a digital assistant for physicians and researchers, aiding in integrating medical imaging data with other multiomics data and facilitating the understanding of complex biological systems. The physician’s and AI’s viewpoints emphasize the value of such AI models in medicine, providing tangible examples of how this could enhance patient care. The editorial also discusses the rise of generative AI, highlighting its substantial impact in democratizing AI applications for modern medicine. While AI may not supersede health care professionals, practitioners incorporating AI into their practices could potentially have a competitive edge.","PeriodicalId":73552,"journal":{"name":"JMIR bioinformatics and biotechnology","volume":"32 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141003645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
User and Usability Testing of a Web-Based Genetics Education Tool for Parkinson Disease: Mixed Methods Study. 基于网络的帕金森病遗传学教育工具的用户和可用性测试:混合方法研究
Pub Date : 2023-08-30 DOI: 10.2196/45370
Noah Han, Rachel A Paul, Tanya Bardakjian, Daniel Kargilis, Angela R Bradbury, Alice Chen-Plotkin, Thomas F Tropea

Background: Genetic testing is essential to identify research participants for clinical trials enrolling people with Parkinson disease (PD) carrying a variant in the glucocerebrosidase (GBA) or leucine-rich repeat kinase 2 (LRRK2) genes. The limited availability of professionals trained in neurogenetics or genetic counseling is a major barrier to increased testing. Telehealth solutions to increase access to genetics education can help address issues around counselor availability and offer options to patients and family members.

Objective: As an alternative to pretest genetic counseling, we developed a web-based genetics education tool focused on GBA and LRRK2 testing for PD called the Interactive Multimedia Approach to Genetic Counseling to Inform and Educate in Parkinson's Disease (IMAGINE-PD) and conducted user testing and usability testing. The objective was to conduct user and usability testing to obtain stakeholder feedback to improve IMAGINE-PD.

Methods: Genetic counselors and PD and neurogenetics subject matter experts developed content for IMAGINE-PD specifically focused on GBA and LRRK2 genetic testing. Structured interviews were conducted with 11 movement disorder specialists and 13 patients with PD to evaluate the content of IMAGINE-PD in user testing and with 12 patients with PD to evaluate the usability of a high-fidelity prototype according to the US Department of Health and Human Services Research-Based Web Design & Usability Guidelines. Qualitative data analysis informed changes to create a final version of IMAGINE-PD.

Results: Qualitative data were reviewed by 3 evaluators. Themes were identified from feedback data of movement disorder specialists and patients with PD in user testing in 3 areas: content such as the topics covered, function such as website navigation, and appearance such as pictures and colors. Similarly, qualitative analysis of usability testing feedback identified additional themes in these 3 areas. Key points of feedback were determined by consensus among reviewers considering the importance of the comment and the frequency of similar comments. Refinements were made to IMAGINE-PD based on consensus recommendations by evaluators within each theme at both user testing and usability testing phases to create a final version of IMAGINE-PD.

Conclusions: User testing for content review and usability testing have informed refinements to IMAGINE-PD to develop this focused, genetics education tool for GBA and LRRK2 testing. Comparison of this stakeholder-informed intervention to standard telegenetic counseling approaches is ongoing.

基因检测对于确定临床试验的研究参与者至关重要,这些临床试验招募了携带葡萄糖脑苷酶(GBA)或富亮氨酸重复激酶2 (LRRK2)基因变异的帕金森病(PD)患者。在神经遗传学或遗传咨询方面受过训练的专业人员有限是增加检测的主要障碍。远程医疗解决方案可以增加获得遗传学教育的机会,有助于解决咨询师可用性的问题,并为患者和家庭成员提供选择。作为测试前遗传咨询的替代方案,我们开发了一个基于网络的遗传教育工具,专注于帕金森病的GBA和LRRK2测试,称为交互式多媒体方法遗传咨询,以告知和教育帕金森病(imagination -PD),并进行了用户测试和可用性测试。目标是进行用户和可用性测试,以获得涉众的反馈,以改进IMAGINE-PD。遗传咨询师、PD和神经遗传学主题专家为IMAGINE-PD开发了专门关注GBA和LRRK2基因检测的内容。根据美国卫生和人类服务部基于研究的网页设计和可用性指南,对11名运动障碍专家和13名PD患者进行了结构化访谈,以评估IMAGINE-PD在用户测试中的内容,并对12名PD患者进行了高保真原型的可用性评估。定性数据分析告知变更,以创建最终版本的IMAGINE-PD。定性资料由3名评价员进行审查。从运动障碍专家和PD患者在用户测试中的反馈数据中确定主题,包括3个方面:内容(如所涵盖的主题),功能(如网站导航)和外观(如图片和颜色)。同样,对可用性测试反馈的定性分析确定了这3个领域的其他主题。反馈的关键点是由审稿人考虑到评论的重要性和类似评论的频率一致确定的。根据每个主题的评估人员在用户测试和可用性测试阶段的一致建议,对IMAGINE-PD进行了改进,以创建IMAGINE-PD的最终版本。内容审查和可用性测试的用户测试通知了IMAGINE-PD的改进,以开发这一专注于GBA和LRRK2测试的遗传教育工具。这种利益相关者知情干预与标准远程遗传咨询方法的比较正在进行中。
{"title":"User and Usability Testing of a Web-Based Genetics Education Tool for Parkinson Disease: Mixed Methods Study.","authors":"Noah Han, Rachel A Paul, Tanya Bardakjian, Daniel Kargilis, Angela R Bradbury, Alice Chen-Plotkin, Thomas F Tropea","doi":"10.2196/45370","DOIUrl":"10.2196/45370","url":null,"abstract":"<p><strong>Background: </strong>Genetic testing is essential to identify research participants for clinical trials enrolling people with Parkinson disease (PD) carrying a variant in the glucocerebrosidase (GBA) or leucine-rich repeat kinase 2 (LRRK2) genes. The limited availability of professionals trained in neurogenetics or genetic counseling is a major barrier to increased testing. Telehealth solutions to increase access to genetics education can help address issues around counselor availability and offer options to patients and family members.</p><p><strong>Objective: </strong>As an alternative to pretest genetic counseling, we developed a web-based genetics education tool focused on GBA and LRRK2 testing for PD called the Interactive Multimedia Approach to Genetic Counseling to Inform and Educate in Parkinson's Disease (IMAGINE-PD) and conducted user testing and usability testing. The objective was to conduct user and usability testing to obtain stakeholder feedback to improve IMAGINE-PD.</p><p><strong>Methods: </strong>Genetic counselors and PD and neurogenetics subject matter experts developed content for IMAGINE-PD specifically focused on GBA and LRRK2 genetic testing. Structured interviews were conducted with 11 movement disorder specialists and 13 patients with PD to evaluate the content of IMAGINE-PD in user testing and with 12 patients with PD to evaluate the usability of a high-fidelity prototype according to the US Department of Health and Human Services Research-Based Web Design & Usability Guidelines. Qualitative data analysis informed changes to create a final version of IMAGINE-PD.</p><p><strong>Results: </strong>Qualitative data were reviewed by 3 evaluators. Themes were identified from feedback data of movement disorder specialists and patients with PD in user testing in 3 areas: content such as the topics covered, function such as website navigation, and appearance such as pictures and colors. Similarly, qualitative analysis of usability testing feedback identified additional themes in these 3 areas. Key points of feedback were determined by consensus among reviewers considering the importance of the comment and the frequency of similar comments. Refinements were made to IMAGINE-PD based on consensus recommendations by evaluators within each theme at both user testing and usability testing phases to create a final version of IMAGINE-PD.</p><p><strong>Conclusions: </strong>User testing for content review and usability testing have informed refinements to IMAGINE-PD to develop this focused, genetics education tool for GBA and LRRK2 testing. Comparison of this stakeholder-informed intervention to standard telegenetic counseling approaches is ongoing.</p>","PeriodicalId":73552,"journal":{"name":"JMIR bioinformatics and biotechnology","volume":"1 1","pages":"e45370"},"PeriodicalIF":0.0,"publicationDate":"2023-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11135229/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42812530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine Learning for Prediction of Maternal Hemorrhage and Transfusion (Preprint) 预测产妇出血和输血的机器学习(预印本)
Pub Date : 2023-08-22 DOI: 10.2196/52059
H. Ahmadzia, Alexa C Dzienny, Mike Bopf, Jaclyn M Phillips, Jerome Jeffrey Federspiel, Richard Amdur, Madeline Murguia Rice, Laritza Rodriguez
Objectives: To improve PPH prediction and to compare machine learning and traditional statistical methods. Design: Cross-sectional Setting: Deliveries across US hospitals Population: Deliveries across 12 US hospitals from the 2002-2008 Consortium for Safe Labor dataset Method: We developed models using the Consortium for Safe Labor dataset. Fifty antepartum and intrapartum characteristics and hospital characteristics were included. Logistic regression, support vector machines, multi-layer perceptron, random forest
目的:改进 PPH 预测,比较机器学习和传统统计方法。设计:横断面横断面美国医院的分娩人口:来自 2002-2008 年安全分娩联盟数据集的 12 家美国医院的分娩情况:我们利用安全分娩联盟的数据集开发了模型。其中包括 50 个产前和产中特征以及医院特征。逻辑回归、支持向量机、多层感知器、随机森林
{"title":"Machine Learning for Prediction of Maternal Hemorrhage and Transfusion (Preprint)","authors":"H. Ahmadzia, Alexa C Dzienny, Mike Bopf, Jaclyn M Phillips, Jerome Jeffrey Federspiel, Richard Amdur, Madeline Murguia Rice, Laritza Rodriguez","doi":"10.2196/52059","DOIUrl":"https://doi.org/10.2196/52059","url":null,"abstract":"Objectives: To improve PPH prediction and to compare machine learning and traditional statistical methods. Design: Cross-sectional Setting: Deliveries across US hospitals Population: Deliveries across 12 US hospitals from the 2002-2008 Consortium for Safe Labor dataset Method: We developed models using the Consortium for Safe Labor dataset. Fifty antepartum and intrapartum characteristics and hospital characteristics were included. Logistic regression, support vector machines, multi-layer perceptron, random forest","PeriodicalId":73552,"journal":{"name":"JMIR bioinformatics and biotechnology","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139349546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Secure Comparisons of Single Nucleotide Polymorphisms Using Secure Multiparty Computation: Method Development. 使用安全多方计算对单核苷酸多态性进行安全比较(预印本)
Pub Date : 2023-07-18 DOI: 10.2196/44700
Andrew Woods, Skyler T Kramer, Dong Xu, Wei Jiang

Background: While genomic variations can provide valuable information for health care and ancestry, the privacy of individual genomic data must be protected. Thus, a secure environment is desirable for a human DNA database such that the total data are queryable but not directly accessible to involved parties (eg, data hosts and hospitals) and that the query results are learned only by the user or authorized party.

Objective: In this study, we provide efficient and secure computations on panels of single nucleotide polymorphisms (SNPs) from genomic sequences as computed under the following set operations: union, intersection, set difference, and symmetric difference.

Methods: Using these operations, we can compute similarity metrics, such as the Jaccard similarity, which could allow querying a DNA database to find the same person and genetic relatives securely. We analyzed various security paradigms and show metrics for the protocols under several security assumptions, such as semihonest, malicious with honest majority, and malicious with a malicious majority.

Results: We show that our methods can be used practically on realistically sized data. Specifically, we can compute the Jaccard similarity of two genomes when considering sets of SNPs, each with 400,000 SNPs, in 2.16 seconds with the assumption of a malicious adversary in an honest majority and 0.36 seconds under a semihonest model.

Conclusions: Our methods may help adopt trusted environments for hosting individual genomic data with end-to-end data security.

背景:虽然基因组变异可以为医疗保健和祖先提供有价值的信息,但个人基因组数据的隐私必须得到保护。因此,人类 DNA 数据库需要一个安全的环境,使所有数据可以查询,但相关方(如数据主机和医院)不能直接访问,只有用户或授权方才能了解查询结果:在这项研究中,我们提供了对基因组序列中单核苷酸多态性(SNPs)面板的高效安全计算,计算方法包括以下集合运算:联合、交集、集合差和对称差:利用这些运算,我们可以计算出相似度指标,如 Jaccard 相似度,从而可以查询 DNA 数据库,安全地找到同一个人和遗传亲属。我们分析了各种安全范式,并展示了在半诚信、恶意与诚信多数、恶意与恶意多数等几种安全假设下的协议度量:我们的研究结果表明,我们的方法可以实际应用于真实大小的数据。具体来说,当考虑到 SNPs 集(每个 SNPs 集有 400,000 个 SNPs)时,我们可以在 2.16 秒内计算出两个基因组的 Jaccard 相似度(假设恶意对手处于诚实多数),而在半诚实模型下只需 0.36 秒:我们的方法有助于采用可信环境来托管具有端到端数据安全性的个体基因组数据。
{"title":"Secure Comparisons of Single Nucleotide Polymorphisms Using Secure Multiparty Computation: Method Development.","authors":"Andrew Woods, Skyler T Kramer, Dong Xu, Wei Jiang","doi":"10.2196/44700","DOIUrl":"10.2196/44700","url":null,"abstract":"<p><strong>Background: </strong>While genomic variations can provide valuable information for health care and ancestry, the privacy of individual genomic data must be protected. Thus, a secure environment is desirable for a human DNA database such that the total data are queryable but not directly accessible to involved parties (eg, data hosts and hospitals) and that the query results are learned only by the user or authorized party.</p><p><strong>Objective: </strong>In this study, we provide efficient and secure computations on panels of single nucleotide polymorphisms (SNPs) from genomic sequences as computed under the following set operations: union, intersection, set difference, and symmetric difference.</p><p><strong>Methods: </strong>Using these operations, we can compute similarity metrics, such as the Jaccard similarity, which could allow querying a DNA database to find the same person and genetic relatives securely. We analyzed various security paradigms and show metrics for the protocols under several security assumptions, such as semihonest, malicious with honest majority, and malicious with a malicious majority.</p><p><strong>Results: </strong>We show that our methods can be used practically on realistically sized data. Specifically, we can compute the Jaccard similarity of two genomes when considering sets of SNPs, each with 400,000 SNPs, in 2.16 seconds with the assumption of a malicious adversary in an honest majority and 0.36 seconds under a semihonest model.</p><p><strong>Conclusions: </strong>Our methods may help adopt trusted environments for hosting individual genomic data with end-to-end data security.</p>","PeriodicalId":73552,"journal":{"name":"JMIR bioinformatics and biotechnology","volume":" ","pages":"e44700"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11135223/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49648411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mutations of SARS-CoV-2 Structural Proteins in the Alpha, Beta, Gamma, and Delta Variants: Bioinformatics Analysis. SARS-CoV-2结构蛋白在α、β、γ和δ变体中的突变:生物信息学分析。
Pub Date : 2023-07-14 eCollection Date: 2023-01-01 DOI: 10.2196/43906
Saima Rehman Khetran, Roma Mustafa

Background: COVID-19 and Middle East Respiratory Syndrome are two pandemic respiratory diseases caused by coronavirus species. The novel disease COVID-19 caused by SARS-CoV-2 was first reported in Wuhan, Hubei Province, China, in December 2019, and became a pandemic within 2-3 months, affecting social and economic platforms worldwide. Despite the rapid development of vaccines, there have been obstacles to their distribution, including a lack of fundamental resources, poor immunization, and manual vaccine replication. Several variants of the original Wuhan strain have emerged in the last 3 years, which can pose a further challenge for control and vaccine development.

Objective: The aim of this study was to comprehensively analyze mutations in SARS-CoV-2 variants of concern (VoCs) using a bioinformatics approach toward identifying novel mutations that may be helpful in developing new vaccines by targeting these sites.

Methods: Reference sequences of the SARS-CoV-2 spike (YP_009724390) and nucleocapsid (YP_009724397) proteins were compared to retrieved sequences of isolates of four VoCs from 14 countries for mutational and evolutionary analyses. Multiple sequence alignment was performed and phylogenetic trees were constructed by the neighbor-joining method with 1000 bootstrap replicates using MEGA (version 6). Mutations in amino acid sequences were analyzed using the MultAlin online tool (version 5.4.1).

Results: Among the four VoCs, a total of 143 nonsynonymous mutations and 8 deletions were identified in the spike and nucleocapsid proteins. Multiple sequence alignment and amino acid substitution analysis revealed new mutations, including G72W, M2101I, L139F, 209-211 deletion, G212S, P199L, P67S, I292T, and substitutions with unknown amino acid replacement, reported in Egypt (MW533289), the United Kingdom (MT906649), and other regions. The variants B.1.1.7 (Alpha variant) and B.1.617.2 (Delta variant), characterized by higher transmissibility and lethality, harbored the amino acid substitutions D614G, R203K, and G204R with higher prevalence rates in most sequences. Phylogenetic analysis among the novel SARS-CoV-2 variant proteins and some previously reported β-coronavirus proteins indicated that either the evolutionary clade was weakly supported or not supported at all by the β-coronavirus species.

Conclusions: This study could contribute toward gaining a better understanding of the basic nature of SARS-CoV-2 and its four major variants. The numerous novel mutations detected could also provide a better understanding of VoCs and help in identifying suitable mutations for vaccine targets. Moreover, these data offer evidence for new types of mutations in VoCs, which will provide insight into the epidemiology of SARS-CoV-2.

背景:COVID-19和中东呼吸综合征是由冠状病毒引起的两种呼吸道流行病。由SARS-CoV-2引起的新型疾病COVID-19于2019年12月在中国湖北省武汉市首次报告,并在2-3个月内成为大流行病,影响了全世界的社会和经济平台。尽管疫苗发展迅速,但其流通一直存在障碍,包括缺乏基础资源、免疫不力、疫苗人工复制等。近三年来,武汉病毒的原始毒株出现了多个变种,这可能会对控制和疫苗开发带来进一步的挑战:本研究的目的是利用生物信息学方法全面分析 SARS-CoV-2 变异株(VoCs)中的突变,以确定新的突变位点,从而帮助针对这些位点开发新的疫苗:方法:将 SARS-CoV-2 棘突蛋白(YP_009724390)和核壳蛋白(YP_009724397)的参考序列与检索到的来自 14 个国家的 4 个 VoCs 分离物的序列进行比较,以进行突变和进化分析。进行了多重序列比对,并使用 MEGA(版本 6)以 1000 次引导重复的邻接法构建了系统发生树。使用 MultAlin 在线工具(5.4.1 版)分析了氨基酸序列中的突变:结果:在四种 VoCs 中,共发现 143 个非同义突变和 8 个核壳蛋白缺失。多重序列比对和氨基酸替换分析发现了新的突变,包括埃及(MW533289)、英国(MT906649)和其他地区报道的 G72W、M2101I、L139F、209-211 缺失、G212S、P199L、P67S、I292T 和未知氨基酸替换。变异体 B.1.1.7(Alpha 变异体)和 B.1.617.2(Delta 变异体)具有较高的传播性和致死性,其氨基酸替换为 D614G、R203K 和 G204R,在大多数序列中的流行率较高。新型SARS-CoV-2变体蛋白与之前报道的一些β-冠状病毒蛋白之间的系统发生分析表明,β-冠状病毒物种对进化支系的支持较弱或根本不支持:结论:这项研究有助于更好地了解 SARS-CoV-2 及其四个主要变种的基本性质。检测到的大量新型突变也有助于更好地了解 VoCs,并帮助确定合适的突变作为疫苗靶标。此外,这些数据还为 VoCs 中的新型变异提供了证据,有助于深入了解 SARS-CoV-2 的流行病学。
{"title":"Mutations of SARS-CoV-2 Structural Proteins in the Alpha, Beta, Gamma, and Delta Variants: Bioinformatics Analysis.","authors":"Saima Rehman Khetran, Roma Mustafa","doi":"10.2196/43906","DOIUrl":"10.2196/43906","url":null,"abstract":"<p><strong>Background: </strong>COVID-19 and Middle East Respiratory Syndrome are two pandemic respiratory diseases caused by coronavirus species. The novel disease COVID-19 caused by SARS-CoV-2 was first reported in Wuhan, Hubei Province, China, in December 2019, and became a pandemic within 2-3 months, affecting social and economic platforms worldwide. Despite the rapid development of vaccines, there have been obstacles to their distribution, including a lack of fundamental resources, poor immunization, and manual vaccine replication. Several variants of the original Wuhan strain have emerged in the last 3 years, which can pose a further challenge for control and vaccine development.</p><p><strong>Objective: </strong>The aim of this study was to comprehensively analyze mutations in SARS-CoV-2 variants of concern (VoCs) using a bioinformatics approach toward identifying novel mutations that may be helpful in developing new vaccines by targeting these sites.</p><p><strong>Methods: </strong>Reference sequences of the SARS-CoV-2 spike (YP_009724390) and nucleocapsid (YP_009724397) proteins were compared to retrieved sequences of isolates of four VoCs from 14 countries for mutational and evolutionary analyses. Multiple sequence alignment was performed and phylogenetic trees were constructed by the neighbor-joining method with 1000 bootstrap replicates using MEGA (version 6). Mutations in amino acid sequences were analyzed using the MultAlin online tool (version 5.4.1).</p><p><strong>Results: </strong>Among the four VoCs, a total of 143 nonsynonymous mutations and 8 deletions were identified in the spike and nucleocapsid proteins. Multiple sequence alignment and amino acid substitution analysis revealed new mutations, including G72W, M2101I, L139F, 209-211 deletion, G212S, P199L, P67S, I292T, and substitutions with unknown amino acid replacement, reported in Egypt (MW533289), the United Kingdom (MT906649), and other regions. The variants B.1.1.7 (Alpha variant) and B.1.617.2 (Delta variant), characterized by higher transmissibility and lethality, harbored the amino acid substitutions D614G, R203K, and G204R with higher prevalence rates in most sequences. Phylogenetic analysis among the novel SARS-CoV-2 variant proteins and some previously reported β-coronavirus proteins indicated that either the evolutionary clade was weakly supported or not supported at all by the β-coronavirus species.</p><p><strong>Conclusions: </strong>This study could contribute toward gaining a better understanding of the basic nature of SARS-CoV-2 and its four major variants. The numerous novel mutations detected could also provide a better understanding of VoCs and help in identifying suitable mutations for vaccine targets. Moreover, these data offer evidence for new types of mutations in VoCs, which will provide insight into the epidemiology of SARS-CoV-2.</p>","PeriodicalId":73552,"journal":{"name":"JMIR bioinformatics and biotechnology","volume":"4 ","pages":"e43906"},"PeriodicalIF":0.0,"publicationDate":"2023-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10353769/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9867153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introducing JMIR Bioinformatics and Biotechnology: A Platform for Interdisciplinary Collaboration and Cutting-Edge Research. 介绍JMIR生物信息学和生物技术:跨学科合作和前沿研究的平台
Pub Date : 2023-06-12 DOI: 10.2196/48631
Ece Dilber Gamsiz Uzun

JMIR Bioinformatics and Biotechnology supports interdisciplinary research and welcomes contributions that push the boundaries of bioinformatics, genomics, artificial intelligence, and pathology informatics.

JMIR生物信息学和生物技术支持跨学科研究,并欢迎推动生物信息学、基因组学、人工智能和病理学信息学边界的贡献。
{"title":"Introducing JMIR Bioinformatics and Biotechnology: A Platform for Interdisciplinary Collaboration and Cutting-Edge Research.","authors":"Ece Dilber Gamsiz Uzun","doi":"10.2196/48631","DOIUrl":"10.2196/48631","url":null,"abstract":"<p><p>JMIR Bioinformatics and Biotechnology supports interdisciplinary research and welcomes contributions that push the boundaries of bioinformatics, genomics, artificial intelligence, and pathology informatics.</p>","PeriodicalId":73552,"journal":{"name":"JMIR bioinformatics and biotechnology","volume":" ","pages":"e48631"},"PeriodicalIF":0.0,"publicationDate":"2023-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11135224/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49364821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Genomic Insights Into the Evolution and Demographic History of the SARS-CoV-2 Omicron Variant: Population Genomics Approach. 对 SARS-CoV-2 Omicron 变异体的进化和人口历史的基因组学洞察:群体基因组学方法
Pub Date : 2023-06-12 eCollection Date: 2023-01-01 DOI: 10.2196/40673
Kritika M Garg, Vinita Lamba, Balaji Chattopadhyay

Background: A thorough understanding of the patterns of genetic subdivision in a pathogen can provide crucial information that is necessary to prevent disease spread. For SARS-CoV-2, the availability of millions of genomes makes this task analytically challenging, and traditional methods for understanding genetic subdivision often fail.

Objective: The aim of our study was to use population genomics methods to identify the subtle subdivisions and demographic history of the Omicron variant, in addition to those captured by the Pango lineage.

Methods: We used a combination of an evolutionary network approach and multivariate statistical protocols to understand the subdivision and spread of the Omicron variant. We identified subdivisions within the BA.1 and BA.2 lineages and further identified the mutations associated with each cluster. We further characterized the overall genomic diversity of the Omicron variant and assessed the selection pressure for each of the genetic clusters identified.

Results: We observed concordant results, using two different methods to understand genetic subdivision. The overall pattern of subdivision in the Omicron variant was in broad agreement with the Pango lineage definition. Further, 1 cluster of the BA.1 lineage and 3 clusters of the BA.2 lineage revealed statistically significant signatures of selection or demographic expansion (Tajima's D<-2), suggesting the role of microevolutionary processes in the spread of the virus.

Conclusions: We provide an easy framework for assessing the genetic structure and demographic history of SARS-CoV-2, which can be particularly useful for understanding the local history of the virus. We identified important mutations that are advantageous to some lineages of Omicron and aid in the transmission of the virus. This is crucial information for policy makers, as preventive measures can be designed to mitigate further spread based on a holistic understanding of the variability of the virus and the evolutionary processes aiding its spread.

背景:透彻了解病原体的基因细分模式可以提供预防疾病传播所需的重要信息。对于 SARS-CoV-2 而言,数百万个基因组的存在使这项任务在分析上具有挑战性,而了解基因细分的传统方法往往会失败:我们研究的目的是利用群体基因组学方法,在 Pango 系的基础上确定 Omicron 变体的细分和人口历史:我们结合使用了进化网络方法和多元统计方案,以了解奥米克隆变体的细分和传播。我们确定了 BA.1 和 BA.2 系的细分,并进一步确定了与每个群相关的突变。我们进一步确定了 Omicron 变异体的整体基因组多样性,并评估了每个已确定基因簇的选择压力:结果:我们使用两种不同的方法来理解基因细分,观察到了一致的结果。奥米克隆变体的整体细分模式与潘戈系的定义基本一致。此外,BA.1系的1个聚类和BA.2系的3个聚类在统计学上显示出明显的选择或人口扩张特征(Tajima's DConclusions):我们为评估 SARS-CoV-2 的遗传结构和种群历史提供了一个简便的框架,这对了解病毒的本地历史特别有用。我们发现了一些重要的突变,这些突变对 Omicron 的某些品系有利,有助于病毒的传播。这对政策制定者来说是至关重要的信息,因为可以在全面了解病毒的变异性和帮助其传播的进化过程的基础上,设计预防措施,以减少病毒的进一步传播。
{"title":"Genomic Insights Into the Evolution and Demographic History of the SARS-CoV-2 Omicron Variant: Population Genomics Approach.","authors":"Kritika M Garg, Vinita Lamba, Balaji Chattopadhyay","doi":"10.2196/40673","DOIUrl":"10.2196/40673","url":null,"abstract":"<p><strong>Background: </strong>A thorough understanding of the patterns of genetic subdivision in a pathogen can provide crucial information that is necessary to prevent disease spread. For SARS-CoV-2, the availability of millions of genomes makes this task analytically challenging, and traditional methods for understanding genetic subdivision often fail.</p><p><strong>Objective: </strong>The aim of our study was to use population genomics methods to identify the subtle subdivisions and demographic history of the Omicron variant, in addition to those captured by the Pango lineage.</p><p><strong>Methods: </strong>We used a combination of an evolutionary network approach and multivariate statistical protocols to understand the subdivision and spread of the Omicron variant. We identified subdivisions within the BA.1 and BA.2 lineages and further identified the mutations associated with each cluster. We further characterized the overall genomic diversity of the Omicron variant and assessed the selection pressure for each of the genetic clusters identified.</p><p><strong>Results: </strong>We observed concordant results, using two different methods to understand genetic subdivision. The overall pattern of subdivision in the Omicron variant was in broad agreement with the Pango lineage definition. Further, 1 cluster of the BA.1 lineage and 3 clusters of the BA.2 lineage revealed statistically significant signatures of selection or demographic expansion (Tajima's D<-2), suggesting the role of microevolutionary processes in the spread of the virus.</p><p><strong>Conclusions: </strong>We provide an easy framework for assessing the genetic structure and demographic history of SARS-CoV-2, which can be particularly useful for understanding the local history of the virus. We identified important mutations that are advantageous to some lineages of Omicron and aid in the transmission of the virus. This is crucial information for policy makers, as preventive measures can be designed to mitigate further spread based on a holistic understanding of the variability of the virus and the evolutionary processes aiding its spread.</p>","PeriodicalId":73552,"journal":{"name":"JMIR bioinformatics and biotechnology","volume":"4 ","pages":"e40673"},"PeriodicalIF":0.0,"publicationDate":"2023-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10331448/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9815596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Decision of the Optimal Rank of a Nonnegative Matrix Factorization Model for Gene Expression Data Sets Utilizing the Unit Invariant Knee Method: Development and Evaluation of the Elbow Method for Rank Selection. 利用单位不变膝法确定基因表达数据集非负矩阵分解模型的最优秩(预印本)
Pub Date : 2023-06-06 DOI: 10.2196/43665
Emine Guven

Background: There is a great need to develop a computational approach to analyze and exploit the information contained in gene expression data. The recent utilization of nonnegative matrix factorization (NMF) in computational biology has demonstrated the capability to derive essential details from a high amount of data in particular gene expression microarrays. A common problem in NMF is finding the proper number rank (r) of factors of the degraded demonstration, but no agreement exists on which technique is most appropriate to utilize for this purpose. Thus, various techniques have been suggested to select the optimal value of rank factorization (r).

Objective: In this work, a new metric for rank selection is proposed based on the elbow method, which was methodically compared against the cophenetic metric.

Methods: To decide the optimum number rank (r), this study focused on the unit invariant knee (UIK) method of the NMF on gene expression data sets. Since the UIK method requires an extremum distance estimator that is eventually employed for inflection and identification of a knee point, the proposed method finds the first inflection point of the curvature of the residual sum of squares of the proposed algorithms using the UIK method on gene expression data sets as a target matrix.

Results: Computation was conducted for the UIK task using gene expression data of acute lymphoblastic leukemia and acute myeloid leukemia samples. Consequently, the distinct results of NMF were subjected to comparison on different algorithms. The proposed UIK method is easy to perform, fast, free of a priori rank value input, and does not require initial parameters that significantly influence the model's functionality.

Conclusions: This study demonstrates that the elbow method provides a credible prediction for both gene expression data and for precisely estimating simulated mutational processes data with known dimensions. The proposed UIK method is faster than conventional methods, including metrics utilizing the consensus matrix as a criterion for rank selection, while achieving significantly better computational efficiency without visual inspection on the curvatives. Finally, the suggested rank tuning method based on the elbow method for gene expression data is arguably theoretically superior to the cophenetic measure.

背景:目前亟需开发一种计算方法来分析和利用基因表达数据中包含的信息。最近在计算生物学中使用的非负矩阵因式分解(NMF)证明了从大量数据(尤其是基因表达微阵列)中提取重要细节的能力。非负矩阵因式分解中的一个常见问题是找到降级展示因子的适当秩数(r),但对于为此目的使用哪种技术最合适却没有一致意见。因此,人们提出了各种技术来选择秩因子(r)的最佳值:在这项工作中,根据肘法提出了一种新的秩选择度量,并与共轭度量进行了方法上的比较:为了确定最佳数秩(r),本研究重点研究了基因表达数据集上 NMF 的单位不变膝法(UIK)。由于 UIK 方法需要一个极值距离估计器,该估计器最终被用于拐点和膝点的识别,因此提出的方法以基因表达数据集上的 UIK 方法为目标矩阵,找到了所提算法残差平方和曲率的第一个拐点:使用急性淋巴细胞白血病和急性髓性白血病样本的基因表达数据对 UIK 任务进行了计算。因此,对不同算法的 NMF 结果进行了比较。所提出的 UIK 方法易于执行,速度快,不需要先验秩值输入,也不需要对模型功能有重大影响的初始参数:本研究表明,肘部方法既能为基因表达数据提供可靠的预测,也能精确估计已知维度的模拟突变过程数据。所提出的 UIK 方法比传统方法(包括利用共识矩阵作为秩选择标准的度量方法)更快,同时在不对曲线进行目视检查的情况下,计算效率明显更高。最后,建议的基于基因表达数据肘法的秩调整方法可以说在理论上优于共轭度量。
{"title":"Decision of the Optimal Rank of a Nonnegative Matrix Factorization Model for Gene Expression Data Sets Utilizing the Unit Invariant Knee Method: Development and Evaluation of the Elbow Method for Rank Selection.","authors":"Emine Guven","doi":"10.2196/43665","DOIUrl":"10.2196/43665","url":null,"abstract":"<p><strong>Background: </strong>There is a great need to develop a computational approach to analyze and exploit the information contained in gene expression data. The recent utilization of nonnegative matrix factorization (NMF) in computational biology has demonstrated the capability to derive essential details from a high amount of data in particular gene expression microarrays. A common problem in NMF is finding the proper number rank (r) of factors of the degraded demonstration, but no agreement exists on which technique is most appropriate to utilize for this purpose. Thus, various techniques have been suggested to select the optimal value of rank factorization (r).</p><p><strong>Objective: </strong>In this work, a new metric for rank selection is proposed based on the elbow method, which was methodically compared against the cophenetic metric.</p><p><strong>Methods: </strong>To decide the optimum number rank (r), this study focused on the unit invariant knee (UIK) method of the NMF on gene expression data sets. Since the UIK method requires an extremum distance estimator that is eventually employed for inflection and identification of a knee point, the proposed method finds the first inflection point of the curvature of the residual sum of squares of the proposed algorithms using the UIK method on gene expression data sets as a target matrix.</p><p><strong>Results: </strong>Computation was conducted for the UIK task using gene expression data of acute lymphoblastic leukemia and acute myeloid leukemia samples. Consequently, the distinct results of NMF were subjected to comparison on different algorithms. The proposed UIK method is easy to perform, fast, free of a priori rank value input, and does not require initial parameters that significantly influence the model's functionality.</p><p><strong>Conclusions: </strong>This study demonstrates that the elbow method provides a credible prediction for both gene expression data and for precisely estimating simulated mutational processes data with known dimensions. The proposed UIK method is faster than conventional methods, including metrics utilizing the consensus matrix as a criterion for rank selection, while achieving significantly better computational efficiency without visual inspection on the curvatives. Finally, the suggested rank tuning method based on the elbow method for gene expression data is arguably theoretically superior to the cophenetic measure.</p>","PeriodicalId":73552,"journal":{"name":"JMIR bioinformatics and biotechnology","volume":" ","pages":"e43665"},"PeriodicalIF":0.0,"publicationDate":"2023-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11135234/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48883023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
JMIR bioinformatics and biotechnology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1