Sarwan Ali , Tamkanat E. Ali , Haris Mansoor , Prakash Chourasia , Murray Patterson
{"title":"Hist2Vec: A histogram and kernel-based embedding method for molecular sequence analysis","authors":"Sarwan Ali , Tamkanat E. Ali , Haris Mansoor , Prakash Chourasia , Murray Patterson","doi":"10.1016/j.eswa.2025.126859","DOIUrl":null,"url":null,"abstract":"<div><div>Due to the huge surge in genomic data, there is an increasing need for better and more efficient molecular sequence classification techniques. There has been plenty of work proposed by researchers using machine learning models for promising classification results. However, they face few limitations in capturing hierarchical structures and relationships in the molecular sequences. To overcome such limitations, we propose Hist2Vec, a novel kernel-based technique for embedding generation that captures the sequence similarities by constructing histogram-based kernel matrices and Gaussian kernel functions. By building histogram-based representations from the distinct <span><math><mi>k</mi></math></span>-mers and minimizers found in each sequence, Hist2Vec is able to identify similarities between sequences. The sequence information is preserved by converting these representations to higher dimensional feature spaces using Gaussian Kernel functions. Then we apply kernel Principal Component Analysis to obtain the final embedding for the molecular sequences. These embeddings are then used as input to classical machine learning models for supervised analysis. We also establish the theoretical properties of Hist2Vec, ensuring the validity and effectiveness of the method. The experimental evaluation of our method shows that Hist2Vec outperforms all other state-of-the-art methods demonstrating high accuracy of <span><math><mrow><mo>></mo><mn>76</mn><mtext>%</mtext></mrow></math></span> for the Human DNA dataset, <span><math><mrow><mo>></mo><mn>83</mn><mtext>%</mtext></mrow></math></span> for the Coronavirus Host dataset, and high precision in the case of t-Cell dataset.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"273 ","pages":"Article 126859"},"PeriodicalIF":7.5000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425004816","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Due to the huge surge in genomic data, there is an increasing need for better and more efficient molecular sequence classification techniques. There has been plenty of work proposed by researchers using machine learning models for promising classification results. However, they face few limitations in capturing hierarchical structures and relationships in the molecular sequences. To overcome such limitations, we propose Hist2Vec, a novel kernel-based technique for embedding generation that captures the sequence similarities by constructing histogram-based kernel matrices and Gaussian kernel functions. By building histogram-based representations from the distinct -mers and minimizers found in each sequence, Hist2Vec is able to identify similarities between sequences. The sequence information is preserved by converting these representations to higher dimensional feature spaces using Gaussian Kernel functions. Then we apply kernel Principal Component Analysis to obtain the final embedding for the molecular sequences. These embeddings are then used as input to classical machine learning models for supervised analysis. We also establish the theoretical properties of Hist2Vec, ensuring the validity and effectiveness of the method. The experimental evaluation of our method shows that Hist2Vec outperforms all other state-of-the-art methods demonstrating high accuracy of for the Human DNA dataset, for the Coronavirus Host dataset, and high precision in the case of t-Cell dataset.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.