{"title":"CpG Island Detection Using Transformer Model with Conditional Random Field","authors":"Md Jubaer Hossain, M. Bhuiyan, Z. Abdullah","doi":"10.1109/IBSSC56953.2022.10037492","DOIUrl":null,"url":null,"abstract":"Detecting potential locations of CpG islands is one of the first steps for predicting promoter regions of many housekeeping and tissue-specific genes, which in turn, helps identify many epigenetic causes of cancer. Traditionally, finding potential CpG islands computationally involves calculating many manual-features and making different assumptions. Recently, in Natural Language Processing(NLP), transformer architectures incorporating mulit-head attention have surpassed many other sequence processing architectures such as RNN, GRU, LSTM etc. in terms of accuracy, speed, and computational efficiency. One of the major attributes of NLP is Named Entity Recognition(NER), which extracts the relevant information from a long sequence. In this study, CpG island identification is considered as an NER problem and transformer architecture is used for its detection. Conditional random field is further incorporated to include the dependencies of the associated labels. Additional attention mask is included on the input layer to give more importance to the regions relevant to DNA sequence. The publicly available EMBL human DNA database is used for experiments. It is observed that more than 96 % accuracy and 73 % F1-score can be achieved, a superior performance as compared to the existing results in the literature. The proposed approach can be utilized for identifying bio-markers for different important and disease-related genes efficiently. In addition, it may be used for other genome sequence analysis and processing tasks.","PeriodicalId":426897,"journal":{"name":"2022 IEEE Bombay Section Signature Conference (IBSSC)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Bombay Section Signature Conference (IBSSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IBSSC56953.2022.10037492","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Detecting potential locations of CpG islands is one of the first steps for predicting promoter regions of many housekeeping and tissue-specific genes, which in turn, helps identify many epigenetic causes of cancer. Traditionally, finding potential CpG islands computationally involves calculating many manual-features and making different assumptions. Recently, in Natural Language Processing(NLP), transformer architectures incorporating mulit-head attention have surpassed many other sequence processing architectures such as RNN, GRU, LSTM etc. in terms of accuracy, speed, and computational efficiency. One of the major attributes of NLP is Named Entity Recognition(NER), which extracts the relevant information from a long sequence. In this study, CpG island identification is considered as an NER problem and transformer architecture is used for its detection. Conditional random field is further incorporated to include the dependencies of the associated labels. Additional attention mask is included on the input layer to give more importance to the regions relevant to DNA sequence. The publicly available EMBL human DNA database is used for experiments. It is observed that more than 96 % accuracy and 73 % F1-score can be achieved, a superior performance as compared to the existing results in the literature. The proposed approach can be utilized for identifying bio-markers for different important and disease-related genes efficiently. In addition, it may be used for other genome sequence analysis and processing tasks.