Hist2Vec: A histogram and kernel-based embedding method for molecular sequence analysis

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Expert Systems with Applications Pub Date : 2025-02-18 DOI:10.1016/j.eswa.2025.126859

Sarwan Ali , Tamkanat E. Ali , Haris Mansoor , Prakash Chourasia , Murray Patterson

{"title":"Hist2Vec: A histogram and kernel-based embedding method for molecular sequence analysis","authors":"Sarwan Ali , Tamkanat E. Ali , Haris Mansoor , Prakash Chourasia , Murray Patterson","doi":"10.1016/j.eswa.2025.126859","DOIUrl":null,"url":null,"abstract":"<div><div>Due to the huge surge in genomic data, there is an increasing need for better and more efficient molecular sequence classification techniques. There has been plenty of work proposed by researchers using machine learning models for promising classification results. However, they face few limitations in capturing hierarchical structures and relationships in the molecular sequences. To overcome such limitations, we propose Hist2Vec, a novel kernel-based technique for embedding generation that captures the sequence similarities by constructing histogram-based kernel matrices and Gaussian kernel functions. By building histogram-based representations from the distinct <span><math><mi>k</mi></math></span>-mers and minimizers found in each sequence, Hist2Vec is able to identify similarities between sequences. The sequence information is preserved by converting these representations to higher dimensional feature spaces using Gaussian Kernel functions. Then we apply kernel Principal Component Analysis to obtain the final embedding for the molecular sequences. These embeddings are then used as input to classical machine learning models for supervised analysis. We also establish the theoretical properties of Hist2Vec, ensuring the validity and effectiveness of the method. The experimental evaluation of our method shows that Hist2Vec outperforms all other state-of-the-art methods demonstrating high accuracy of <span><math><mrow><mo>></mo><mn>76</mn><mtext>%</mtext></mrow></math></span> for the Human DNA dataset, <span><math><mrow><mo>></mo><mn>83</mn><mtext>%</mtext></mrow></math></span> for the Coronavirus Host dataset, and high precision in the case of t-Cell dataset.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"273 ","pages":"Article 126859"},"PeriodicalIF":7.5000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425004816","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Due to the huge surge in genomic data, there is an increasing need for better and more efficient molecular sequence classification techniques. There has been plenty of work proposed by researchers using machine learning models for promising classification results. However, they face few limitations in capturing hierarchical structures and relationships in the molecular sequences. To overcome such limitations, we propose Hist2Vec, a novel kernel-based technique for embedding generation that captures the sequence similarities by constructing histogram-based kernel matrices and Gaussian kernel functions. By building histogram-based representations from the distinct

k

-mers and minimizers found in each sequence, Hist2Vec is able to identify similarities between sequences. The sequence information is preserved by converting these representations to higher dimensional feature spaces using Gaussian Kernel functions. Then we apply kernel Principal Component Analysis to obtain the final embedding for the molecular sequences. These embeddings are then used as input to classical machine learning models for supervised analysis. We also establish the theoretical properties of Hist2Vec, ensuring the validity and effectiveness of the method. The experimental evaluation of our method shows that Hist2Vec outperforms all other state-of-the-art methods demonstrating high accuracy of

> 76 %

for the Human DNA dataset,

> 83 %

for the Coronavirus Host dataset, and high precision in the case of t-Cell dataset.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.