使用自然语言处理（NLP）方法预测2019年俄亥俄州梅毒干预专家（DIS）记录中包含的主题

IF 1.7 4区医学 Q3 INFECTIOUS DISEASES Sexually transmitted diseases Pub Date : 2025-06-01 Epub Date: 2025-02-11 DOI:10.1097/OLQ.0000000000002135

Payal Chakraborty, Xia Ning, Mary McNeill, David M Kline, Abigail B Shoben, William C Miller, Abigail Norris Turner

{"title":"使用自然语言处理（NLP）方法预测2019年俄亥俄州梅毒干预专家（DIS）记录中包含的主题","authors":"Payal Chakraborty, Xia Ning, Mary McNeill, David M Kline, Abigail B Shoben, William C Miller, Abigail Norris Turner","doi":"10.1097/OLQ.0000000000002135","DOIUrl":null,"url":null,"abstract":"Background: Free-text notes in disease intervention specialist (DIS) records may contain relevant information for sexual transmitted infection control. In their current form, the notes are not analyzable without manual reading, which is labor-intensive and prone to error.Methods: We used natural language processing methods to analyze 2019 Ohio DIS syphilis records with nonmissing notes (n = 1987). We identified 21 topics relevant for transmission and case investigations. We manually coded these records to create \"gold standard\" labels for each topic (0 = topic not present, 1 = topic present), then trained machine learning models to identify the topics in the text. For models to analyze text data, the text must be converted to numbers. We explored 2 approaches to numerically represent words: (1) term frequency, inverse document frequency, which measures importance of words based on how many times they appear in a record and in the dataset as a whole, and (2) GloVe embeddings, which are numerical vectors that were developed by researchers for each word in the English language to encode its semantic meaning. We explored 3 types of statistical models (naive Bayes, support vector machine, and logistic regression) using term frequency, inverse document frequency, and 1 type of neural network model (long short-term memory [LSTM] model) using GloVe. All models were used for binary prediction (i.e., topic not present, topic present).Results: For most topics, the LSTM model performed the best overall in identifying topics, and the support vector machine model performed the best among the statistical models. For example, the LSTM model predicted the topic \"substance use\" with high accuracy (97%), sensitivity (92%), and specificity (98%). No model performed well for uncommon topics (e.g., \"alcohol use\" or \"delays in care\").Conclusions: Machine learning models performed well in identifying some topics in 2019 Ohio syphilis records. This analysis is a first step in applying natural language processing methods to making DIS notes more accessible for analysis.","PeriodicalId":21837,"journal":{"name":"Sexually transmitted diseases","volume":" ","pages":"356-363"},"PeriodicalIF":1.7000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12064372/pdf/","citationCount":"0","resultStr":"{\"title\":\"Using Natural Language Processing Methods to Predict Topics Included in 2019 Ohio Syphilis Disease Intervention Specialist Records.\",\"authors\":\"Payal Chakraborty, Xia Ning, Mary McNeill, David M Kline, Abigail B Shoben, William C Miller, Abigail Norris Turner\",\"doi\":\"10.1097/OLQ.0000000000002135\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Free-text notes in disease intervention specialist (DIS) records may contain relevant information for sexual transmitted infection control. In their current form, the notes are not analyzable without manual reading, which is labor-intensive and prone to error.Methods: We used natural language processing methods to analyze 2019 Ohio DIS syphilis records with nonmissing notes (n = 1987). We identified 21 topics relevant for transmission and case investigations. We manually coded these records to create \\\"gold standard\\\" labels for each topic (0 = topic not present, 1 = topic present), then trained machine learning models to identify the topics in the text. For models to analyze text data, the text must be converted to numbers. We explored 2 approaches to numerically represent words: (1) term frequency, inverse document frequency, which measures importance of words based on how many times they appear in a record and in the dataset as a whole, and (2) GloVe embeddings, which are numerical vectors that were developed by researchers for each word in the English language to encode its semantic meaning. We explored 3 types of statistical models (naive Bayes, support vector machine, and logistic regression) using term frequency, inverse document frequency, and 1 type of neural network model (long short-term memory [LSTM] model) using GloVe. All models were used for binary prediction (i.e., topic not present, topic present).Results: For most topics, the LSTM model performed the best overall in identifying topics, and the support vector machine model performed the best among the statistical models. For example, the LSTM model predicted the topic \\\"substance use\\\" with high accuracy (97%), sensitivity (92%), and specificity (98%). No model performed well for uncommon topics (e.g., \\\"alcohol use\\\" or \\\"delays in care\\\").Conclusions: Machine learning models performed well in identifying some topics in 2019 Ohio syphilis records. This analysis is a first step in applying natural language processing methods to making DIS notes more accessible for analysis.\",\"PeriodicalId\":21837,\"journal\":{\"name\":\"Sexually transmitted diseases\",\"volume\":\" \",\"pages\":\"356-363\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2025-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12064372/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Sexually transmitted diseases\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1097/OLQ.0000000000002135\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/11 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"INFECTIOUS DISEASES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sexually transmitted diseases","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/OLQ.0000000000002135","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/11 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"INFECTIOUS DISEASES","Score":null,"Total":0}

引用次数: 0

摘要

背景：疾病干预专家（DIS）记录中的自由文本注释可能包含性病控制的相关信息。在目前的形式下，如果没有人工阅读，这些笔记是无法分析的，这是一项劳动密集型工作，而且容易出错。方法：采用自然语言处理（NLP）方法分析2019年俄亥俄州DIS梅毒病例中未缺失笔记（n = 1,987）。我们确定了21个与传播和病例调查相关的主题。我们手动编码这些记录，为每个主题创建“黄金标准”标签（0 =主题不存在，1 =主题存在），然后训练机器学习模型来识别文本中的主题。对于分析文本数据的模型，必须将文本转换为数字。我们探索了两种数字表示单词的方法：(1)术语频率，逆文档频率（TF-IDF），它根据单词在记录和整个数据集中出现的次数来衡量单词的重要性；(2)GloVe嵌入，这是研究人员为英语中的每个单词开发的数字向量，用于编码其语义。我们使用TF-IDF探索了三种类型的统计模型（naïve Bayes，支持向量机[SVM]和逻辑回归），使用GloVe探索了一种类型的神经网络模型（长短期记忆[LSTM]模型）。所有模型均用于二值预测（即，主题不存在，主题存在）。结果：对于大多数主题，LSTM模型在主题识别方面的总体表现最好，SVM模型在统计模型中表现最好。例如，LSTM模型预测“物质使用”主题具有较高的准确性（97%）、灵敏度（92%）和特异性（98%）。对于不常见的主题（例如，“酒精使用”或“护理延误”），没有模型表现良好。结论：机器学习模型在识别2019年俄亥俄州梅毒记录中的一些主题方面表现良好。这一分析是应用自然语言处理方法使DIS笔记更易于分析的第一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Using Natural Language Processing Methods to Predict Topics Included in 2019 Ohio Syphilis Disease Intervention Specialist Records.

Background: Free-text notes in disease intervention specialist (DIS) records may contain relevant information for sexual transmitted infection control. In their current form, the notes are not analyzable without manual reading, which is labor-intensive and prone to error.

Methods: We used natural language processing methods to analyze 2019 Ohio DIS syphilis records with nonmissing notes (n = 1987). We identified 21 topics relevant for transmission and case investigations. We manually coded these records to create "gold standard" labels for each topic (0 = topic not present, 1 = topic present), then trained machine learning models to identify the topics in the text. For models to analyze text data, the text must be converted to numbers. We explored 2 approaches to numerically represent words: (1) term frequency, inverse document frequency, which measures importance of words based on how many times they appear in a record and in the dataset as a whole, and (2) GloVe embeddings, which are numerical vectors that were developed by researchers for each word in the English language to encode its semantic meaning. We explored 3 types of statistical models (naive Bayes, support vector machine, and logistic regression) using term frequency, inverse document frequency, and 1 type of neural network model (long short-term memory [LSTM] model) using GloVe. All models were used for binary prediction (i.e., topic not present, topic present).

Results: For most topics, the LSTM model performed the best overall in identifying topics, and the support vector machine model performed the best among the statistical models. For example, the LSTM model predicted the topic "substance use" with high accuracy (97%), sensitivity (92%), and specificity (98%). No model performed well for uncommon topics (e.g., "alcohol use" or "delays in care").

Conclusions: Machine learning models performed well in identifying some topics in 2019 Ohio syphilis records. This analysis is a first step in applying natural language processing methods to making DIS notes more accessible for analysis.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Sexually transmitted diseases 医学-传染病学

CiteScore

4.00

自引率

16.10%

发文量

289

审稿时长

3-8 weeks

期刊介绍： Sexually Transmitted Diseases, the official journal of the American Sexually Transmitted Diseases Association, publishes peer-reviewed, original articles on clinical, laboratory, immunologic, epidemiologic, behavioral, public health, and historical topics pertaining to sexually transmitted diseases and related fields. Reports from the CDC and NIH provide up-to-the-minute information. A highly respected editorial board is composed of prominent scientists who are leaders in this rapidly changing field. Included in each issue are studies and developments from around the world.