Pub Date : 2024-08-21DOI: 10.1109/TCBB.2024.3447273
Yufei Li;Xiaoyong Ma;Xiangyu Zhou;Penghzhen Cheng;Kai He;Tieliang Gong;Chen Li
Biomedical Coreference Resolution focuses on identifying the coreferences in biomedical texts, which normally consists of two parts: (i) mention detection to identify textual representation of biological entities and (ii) finding their coreference links. Recently, a popular approach to enhance the task is to embed knowledge base into deep neural networks. However, the way in which these methods integrate knowledge leads to the shortcoming that such knowledge may play a larger role in mention detection than coreference resolution. Specifically, they tend to integrate knowledge prior to mention detection, as part of the embeddings. Besides, they primarily focus on mention-dependent knowledge (KBase), i.e., knowledge entities directly related to mentions, while ignores the correlated knowledge (K+) between mentions in the mention-pair. For mentions with significant differences in word form, this may limit their ability to extract potential correlations between those mentions. Thus, this paper develops a novel model to integrate both KBase and K+ entities and achieves the state-of-the-art performance on BioNLP and CRAFT-CR datasets. Empirical studies on mention detection with different length reveals the effectiveness of the KBase entities. The evaluation on cross-sentence and match/mismatch coreference further demonstrate the superiority of the K+ entities in extracting background potential correlation between mentions.
{"title":"Integrating K+ Entities Into Coreference Resolution on Biomedical Texts","authors":"Yufei Li;Xiaoyong Ma;Xiangyu Zhou;Penghzhen Cheng;Kai He;Tieliang Gong;Chen Li","doi":"10.1109/TCBB.2024.3447273","DOIUrl":"10.1109/TCBB.2024.3447273","url":null,"abstract":"Biomedical Coreference Resolution focuses on identifying the coreferences in biomedical texts, which normally consists of two parts: (i) mention detection to identify textual representation of biological entities and (ii) finding their coreference links. Recently, a popular approach to enhance the task is to embed knowledge base into deep neural networks. However, the way in which these methods integrate knowledge leads to the shortcoming that such knowledge may play a larger role in mention detection than coreference resolution. Specifically, they tend to integrate knowledge prior to mention detection, as part of the embeddings. Besides, they primarily focus on mention-dependent knowledge (KBase), i.e., knowledge entities directly related to mentions, while ignores the correlated knowledge (K+) between mentions in the mention-pair. For mentions with significant differences in word form, this may limit their ability to extract potential correlations between those mentions. Thus, this paper develops a novel model to integrate both KBase and K+ entities and achieves the state-of-the-art performance on BioNLP and CRAFT-CR datasets. Empirical studies on mention detection with different length reveals the effectiveness of the KBase entities. The evaluation on cross-sentence and match/mismatch coreference further demonstrate the superiority of the K+ entities in extracting background potential correlation between mentions.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2145-2155"},"PeriodicalIF":3.6,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142017375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-19DOI: 10.1109/TCBB.2024.3433378
Jeremie S Kim, Can Firtina, Meryem Banu Cavlak, Damla Senol Cali, Nastaran Hajinazar, Mohammed Alser, Can Alkan, Onur Mutlu
AirLift is the first read remapping tool that enables users to quickly and comprehensively map a read set, that had been previously mapped to one reference genome, to another similar reference. Users can then quickly run a downstream analysis of read sets for each latest reference release. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces the overall execution time to remap read sets between two reference genome versions by up to 27.4×. We validate our remapping results with GATK and find that AirLift provides high accuracy in identifying ground truth SNP/INDEL variants.
{"title":"AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes.","authors":"Jeremie S Kim, Can Firtina, Meryem Banu Cavlak, Damla Senol Cali, Nastaran Hajinazar, Mohammed Alser, Can Alkan, Onur Mutlu","doi":"10.1109/TCBB.2024.3433378","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3433378","url":null,"abstract":"<p><p>AirLift is the first read remapping tool that enables users to quickly and comprehensively map a read set, that had been previously mapped to one reference genome, to another similar reference. Users can then quickly run a downstream analysis of read sets for each latest reference release. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces the overall execution time to remap read sets between two reference genome versions by up to 27.4×. We validate our remapping results with GATK and find that AirLift provides high accuracy in identifying ground truth SNP/INDEL variants.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142004162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A transcription factor (TF) is a sequence-specific DNA-binding protein, which plays key roles in cell-fate decision by regulating gene expression. Predicting TFs is key for tea plant research community, as they regulate gene expression, influencing plant growth, development, and stress responses. It is a challenging task through wet lab experimental validation, due to their rarity, as well as the high cost and time requirements. As a result, computational methods are increasingly popular to be chosen. The pre-training strategy has been applied to many tasks in natural language processing (NLP) and has achieved impressive performance. In this paper, we present a novel recognition algorithm named TeaTFactor that utilizes pre-training for the model training of TFs prediction. The model is built upon the BERT architecture, initially pre-trained using protein data from UniProt. Subsequently, the model was fine-tuned using the collected TFs data of tea plants. We evaluated four different word segmentation methods and the existing state-of-the-art prediction tools. According to the comprehensive experimental results and a case study, our model is superior to existing models and achieves the goal of accurate identification. In addition, we have developed a web server at http://teatfactor.tlds.cc