{"title":"A Short Survey of LSTM Models for De-identification of Medical Free Text","authors":"Joffrey L. Leevy, T. Khoshgoftaar","doi":"10.1109/CIC50333.2020.00023","DOIUrl":null,"url":null,"abstract":"The confidentiality of patient information is legislated by governmental regulations in various countries, such as the Health Insurance Portability and Accountability Act (HIPAA) standards in the USA. Under these laws, adequate protections must be in place to safeguard patients' health records, which are often big data comprised of free text. Machine learning approaches are extensively used for the automated de-identification of medical free text, with outstanding results obtained from several studies that incorporate long short-term memory (LSTM) networks. These networks are a variant of the recurrent neural network (RNN) architecture. Our survey of LSTM models dates back five years, and the contribution of the findings is appreciable. Performance-wise, LSTMs generally surpassed other types of models used in automated de-identification of free text, namely conditional random field (CRF) algorithms and rule-based algorithms. In addition, hybrid or ensemble LSTM models did not outperform LSTM -only models. Finally, we note that the customization of gold-standard, de-identification datasets may result in overfitted models.","PeriodicalId":265435,"journal":{"name":"2020 IEEE 6th International Conference on Collaboration and Internet Computing (CIC)","volume":"40 5-6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 6th International Conference on Collaboration and Internet Computing (CIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIC50333.2020.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The confidentiality of patient information is legislated by governmental regulations in various countries, such as the Health Insurance Portability and Accountability Act (HIPAA) standards in the USA. Under these laws, adequate protections must be in place to safeguard patients' health records, which are often big data comprised of free text. Machine learning approaches are extensively used for the automated de-identification of medical free text, with outstanding results obtained from several studies that incorporate long short-term memory (LSTM) networks. These networks are a variant of the recurrent neural network (RNN) architecture. Our survey of LSTM models dates back five years, and the contribution of the findings is appreciable. Performance-wise, LSTMs generally surpassed other types of models used in automated de-identification of free text, namely conditional random field (CRF) algorithms and rule-based algorithms. In addition, hybrid or ensemble LSTM models did not outperform LSTM -only models. Finally, we note that the customization of gold-standard, de-identification datasets may result in overfitted models.