一种基于序列和属性信息的有效算法,用于在多个物种中鉴定 N4-甲基胞嘧啶

IF 0.7 4区 化学 Q4 CHEMISTRY, ORGANIC Letters in Organic Chemistry Pub Date : 2024-01-26 DOI:10.2174/0115701786277281231228093405
Lichao Zhang, Xueting Wang, Kang Xiao, Liang Kong
{"title":"一种基于序列和属性信息的有效算法,用于在多个物种中鉴定 N4-甲基胞嘧啶","authors":"Lichao Zhang, Xueting Wang, Kang Xiao, Liang Kong","doi":"10.2174/0115701786277281231228093405","DOIUrl":null,"url":null,"abstract":": N4-methylcytosine (4mC) is one of the most important epigenetic modifications, which plays a significant role in biological progress and helps explain biological functions. Although biological experiments can identify potential 4mC sites, they are limited due to the experimental environment and labor-intensive process. Therefore, it is crucial to construct a computational model to identify the 4mC sites. Some computational methods have been proposed to identify the 4mC sites, but some problems should not be ignored, such as those presented as follows: (1) a more accurate algorithm is required to improve the prediction, especially for Matthew’s correlation coefficient (MCC); (2) easier method is needed for clinical research to design medicine or treat disease. Considering these aspects, an effective algorithm using comprehensible encoding in multiple species was proposed in this study. Since nucleotide arrangement and its property information could reflect the sequence structure and function, several feature vectors have been developed based on nucleotide energy information, trinucleotide energy information, and nucleotide chemical property information. Besides, feature effect has been analyzed to select the optimal feature vectors for multiple species. Finally, the optimal feature vectors were inputted into the CatBoost algorithm to construct the identification model. The evaluation results showed that our study obtained the highest MCC, i.e., 2.5%~11.1%, 1.4%~17.8%, 1.1%~7.6%, and 2.3%~18.0% higher than previous models for the A. thaliana, C. elegans, D. melanogaster, and E. coli datasets, respectively. These satisfactory results reflect that the proposed method is available to identify 4mC sites in multiple species, especially for MCC. It could provide a reasonable supplement for biological research.","PeriodicalId":18116,"journal":{"name":"Letters in Organic Chemistry","volume":"18 1","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2024-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Effective Algorithm Based on Sequence and Property Information for N4-methylcytosine Identification in Multiple Species\",\"authors\":\"Lichao Zhang, Xueting Wang, Kang Xiao, Liang Kong\",\"doi\":\"10.2174/0115701786277281231228093405\",\"DOIUrl\":null,\"url\":null,\"abstract\":\": N4-methylcytosine (4mC) is one of the most important epigenetic modifications, which plays a significant role in biological progress and helps explain biological functions. Although biological experiments can identify potential 4mC sites, they are limited due to the experimental environment and labor-intensive process. Therefore, it is crucial to construct a computational model to identify the 4mC sites. Some computational methods have been proposed to identify the 4mC sites, but some problems should not be ignored, such as those presented as follows: (1) a more accurate algorithm is required to improve the prediction, especially for Matthew’s correlation coefficient (MCC); (2) easier method is needed for clinical research to design medicine or treat disease. Considering these aspects, an effective algorithm using comprehensible encoding in multiple species was proposed in this study. Since nucleotide arrangement and its property information could reflect the sequence structure and function, several feature vectors have been developed based on nucleotide energy information, trinucleotide energy information, and nucleotide chemical property information. Besides, feature effect has been analyzed to select the optimal feature vectors for multiple species. Finally, the optimal feature vectors were inputted into the CatBoost algorithm to construct the identification model. The evaluation results showed that our study obtained the highest MCC, i.e., 2.5%~11.1%, 1.4%~17.8%, 1.1%~7.6%, and 2.3%~18.0% higher than previous models for the A. thaliana, C. elegans, D. melanogaster, and E. coli datasets, respectively. These satisfactory results reflect that the proposed method is available to identify 4mC sites in multiple species, especially for MCC. It could provide a reasonable supplement for biological research.\",\"PeriodicalId\":18116,\"journal\":{\"name\":\"Letters in Organic Chemistry\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2024-01-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Letters in Organic Chemistry\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.2174/0115701786277281231228093405\",\"RegionNum\":4,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"CHEMISTRY, ORGANIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Letters in Organic Chemistry","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.2174/0115701786277281231228093405","RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"CHEMISTRY, ORGANIC","Score":null,"Total":0}
引用次数: 0

摘要

:N4-甲基胞嘧啶(4mC)是最重要的表观遗传修饰之一,在生物进化中发挥着重要作用,并有助于解释生物功能。虽然生物学实验可以确定潜在的 4mC 位点,但由于实验环境和实验过程耗费大量人力,实验结果有限。因此,构建一个计算模型来识别 4mC 位点至关重要。目前已提出了一些识别 4mC 位点的计算方法,但有些问题不容忽视,如以下问题:(1)需要更精确的算法来提高预测结果,尤其是马修相关系数(MCC);(2)临床研究需要更简便的方法来设计药物或治疗疾病。考虑到这些方面,本研究提出了一种在多个物种中使用可理解编码的有效算法。由于核苷酸排列及其性质信息可以反映序列的结构和功能,因此根据核苷酸能量信息、三核苷酸能量信息和核苷酸化学性质信息开发了多个特征向量。此外,还对特征效应进行了分析,以选择多个物种的最佳特征向量。最后,将最优特征向量输入 CatBoost 算法,构建识别模型。评估结果表明,我们的研究获得了最高的 MCC,即在大连蝙蝠、优雅小鼠、黑腹蝇和大肠杆菌数据集上分别比以前的模型高出 2.5%~11.1%、1.4%~17.8%、1.1%~7.6% 和 2.3%~18.0%。这些令人满意的结果反映了所提出的方法可用于鉴定多个物种的 4mC 位点,尤其是 MCC。它可以为生物学研究提供合理的补充。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
An Effective Algorithm Based on Sequence and Property Information for N4-methylcytosine Identification in Multiple Species
: N4-methylcytosine (4mC) is one of the most important epigenetic modifications, which plays a significant role in biological progress and helps explain biological functions. Although biological experiments can identify potential 4mC sites, they are limited due to the experimental environment and labor-intensive process. Therefore, it is crucial to construct a computational model to identify the 4mC sites. Some computational methods have been proposed to identify the 4mC sites, but some problems should not be ignored, such as those presented as follows: (1) a more accurate algorithm is required to improve the prediction, especially for Matthew’s correlation coefficient (MCC); (2) easier method is needed for clinical research to design medicine or treat disease. Considering these aspects, an effective algorithm using comprehensible encoding in multiple species was proposed in this study. Since nucleotide arrangement and its property information could reflect the sequence structure and function, several feature vectors have been developed based on nucleotide energy information, trinucleotide energy information, and nucleotide chemical property information. Besides, feature effect has been analyzed to select the optimal feature vectors for multiple species. Finally, the optimal feature vectors were inputted into the CatBoost algorithm to construct the identification model. The evaluation results showed that our study obtained the highest MCC, i.e., 2.5%~11.1%, 1.4%~17.8%, 1.1%~7.6%, and 2.3%~18.0% higher than previous models for the A. thaliana, C. elegans, D. melanogaster, and E. coli datasets, respectively. These satisfactory results reflect that the proposed method is available to identify 4mC sites in multiple species, especially for MCC. It could provide a reasonable supplement for biological research.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Letters in Organic Chemistry
Letters in Organic Chemistry 化学-有机化学
CiteScore
1.30
自引率
12.50%
发文量
135
审稿时长
7 months
期刊介绍: Aims & Scope Letters in Organic Chemistry publishes original letters (short articles), research articles, mini-reviews and thematic issues based on mini-reviews and short articles, in all areas of organic chemistry including synthesis, bioorganic, medicinal, natural products, organometallic, supramolecular, molecular recognition and physical organic chemistry. The emphasis is to publish quality papers rapidly by taking full advantage of latest technology for both submission and review of the manuscripts. The journal is an essential reading for all organic chemists belonging to both academia and industry.
期刊最新文献
How Enzyme Selectivity and Immobilization Affect Catalytic Yields in Lipase-Catalyzed Processes Photochemical Dimerization of Indones: A DFT Study Rapid and Metal-Free Green Synthesis of Coumarins Catalyzed by Humic Acid Ce(OTf)3-Catalyzed Synthesis of Glucopyranurono-6,1-Lactone: A Key Intermediate for Obtaining Glycoconjugates of Peptidic Fragments of Arenastatin A An Efficient Metal-Free Methodology for the Synthesis of Hydrazo-Linked 5-(4-aryl)-1H-1,2,4-Triazoles
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1