{"title":"Introduction to the Special Issue on Computational Methods for Biomedical NLP","authors":"M. Devarakonda, E. Voorhees","doi":"10.1145/3492302","DOIUrl":null,"url":null,"abstract":"It is now well established that biomedical text requires methods targeted for the domain. Developments in deep learning and a series of successful shared challenges have contributed to a steady progress in techniques for natural language processing of biomedical text. Contributing to this on-going progress and particularly focusing on computational methods, this special issue was created to encourage research in novel approaches for analyzing biomedical text. The six papers selected for the issue offer a diversity of novel methods that leverage biomedical text for research and clinical uses. A well-established practice in pretraining deep learning models for biomedical applications has been to adopt a most promising model that was already pretrained on general domain natural language corpus and then “add” additional pre-training with biomedical corpora. In “Domain-specific language model pretraining for biomedical natural language processing”, Gu et al. successfully challenge this approach. The authors conducted an experiment where multiple standard benchmarks were used to compare a model that was pre-trained entirely and only on biomedical corpus with models that were pretrained using the “add” on approach. Results showed an impressive improvement in favor of pretraining only with biomedical corpus. The study provides an excellent data-point in support of clarity in model training rather than accumulation. Tariq et al. also find using domain-aware tokenization and embeddings to be more effective in their paper “Bridging the Gap Between Structured and Free-form Radiology Reporting: A Case-study on Coronary CT Angiography”. They compare a variety of models constructed to predict the severity of cardiovascular disease from the language used within free-text radiology reports. Models that used medical-domain-aware tokenization and word embeddings of the reports were consistently more effective than raw word-based. The better models are able to accurately predict disease severity under real-world conditions of diverse terminology from different radiologists and unbalanced class size. Two papers address the problem of maintaining the privacy of clinical documents, though from widely different perspectives. De-identification is the most used approach to eliminate PHI (Protected Health Information) in clinical documents before making the data available to NLP researchers. In “A Context-enhanced De-identification System”, Kahyun et al. describe an improved de-identification technique for clinical records. Their context-enhanced de-identification system called CEDI uses attention mechanisms in a long short-term memory (LSTM) network to capture the appropriate context. This context allows the system to detect dependencies that cross sentence boundaries, an important feature since clinical reports often contain such dependencies. Nonetheless, accurate and broad-coverage de-identification of unstructured data remains challenging, and lack of trust in the process (of de-identification) can be a serious limiting factor for data release. In “Differentially Private Medical Texts Generation using Generative Neural Networks”, Aziz et al. take a different approach to dealing with privacy of clinical documents. They propose synthetic generation of clinical documents with high accuracy as a practical alternative. Using self-attention based neural networks and differential privacy (i.e., the ability to control the level of privacy relative to the original document) in their method,","PeriodicalId":288903,"journal":{"name":"ACM Transactions on Computing for Healthcare (HEALTH)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computing for Healthcare (HEALTH)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3492302","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
It is now well established that biomedical text requires methods targeted for the domain. Developments in deep learning and a series of successful shared challenges have contributed to a steady progress in techniques for natural language processing of biomedical text. Contributing to this on-going progress and particularly focusing on computational methods, this special issue was created to encourage research in novel approaches for analyzing biomedical text. The six papers selected for the issue offer a diversity of novel methods that leverage biomedical text for research and clinical uses. A well-established practice in pretraining deep learning models for biomedical applications has been to adopt a most promising model that was already pretrained on general domain natural language corpus and then “add” additional pre-training with biomedical corpora. In “Domain-specific language model pretraining for biomedical natural language processing”, Gu et al. successfully challenge this approach. The authors conducted an experiment where multiple standard benchmarks were used to compare a model that was pre-trained entirely and only on biomedical corpus with models that were pretrained using the “add” on approach. Results showed an impressive improvement in favor of pretraining only with biomedical corpus. The study provides an excellent data-point in support of clarity in model training rather than accumulation. Tariq et al. also find using domain-aware tokenization and embeddings to be more effective in their paper “Bridging the Gap Between Structured and Free-form Radiology Reporting: A Case-study on Coronary CT Angiography”. They compare a variety of models constructed to predict the severity of cardiovascular disease from the language used within free-text radiology reports. Models that used medical-domain-aware tokenization and word embeddings of the reports were consistently more effective than raw word-based. The better models are able to accurately predict disease severity under real-world conditions of diverse terminology from different radiologists and unbalanced class size. Two papers address the problem of maintaining the privacy of clinical documents, though from widely different perspectives. De-identification is the most used approach to eliminate PHI (Protected Health Information) in clinical documents before making the data available to NLP researchers. In “A Context-enhanced De-identification System”, Kahyun et al. describe an improved de-identification technique for clinical records. Their context-enhanced de-identification system called CEDI uses attention mechanisms in a long short-term memory (LSTM) network to capture the appropriate context. This context allows the system to detect dependencies that cross sentence boundaries, an important feature since clinical reports often contain such dependencies. Nonetheless, accurate and broad-coverage de-identification of unstructured data remains challenging, and lack of trust in the process (of de-identification) can be a serious limiting factor for data release. In “Differentially Private Medical Texts Generation using Generative Neural Networks”, Aziz et al. take a different approach to dealing with privacy of clinical documents. They propose synthetic generation of clinical documents with high accuracy as a practical alternative. Using self-attention based neural networks and differential privacy (i.e., the ability to control the level of privacy relative to the original document) in their method,