{"title":"样本量和疾病流行对叙事数据监督式机器学习的影响。","authors":"Lawrence K McKnight, Adam Wilcox, George Hripcsak","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>This paper examines the independent effects of outcome prevalence and training sample sizes on inductive learning performance. We trained 3 inductive learning algorithms (MC4, IB, and Naïve-Bayes) on 60 simulated datasets of parsed radiology text reports labeled with 6 disease states. Data sets were constructed to define positive outcome states at 4 prevalence rates (1, 5, 10, 25, and 50%) in training set sizes of 200 and 2,000 cases. We found that the effect of outcome prevalence is significant when outcome classes drop below 10% of cases. The effect appeared independent of sample size, induction algorithm used, or class label. Work is needed to identify methods of improving classifier performance when output classes are rare.</p>","PeriodicalId":79712,"journal":{"name":"Proceedings. AMIA Symposium","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2002-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2244149/pdf/procamiasymp00001-0560.pdf","citationCount":"0","resultStr":"{\"title\":\"The effect of sample size and disease prevalence on supervised machine learning of narrative data.\",\"authors\":\"Lawrence K McKnight, Adam Wilcox, George Hripcsak\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>This paper examines the independent effects of outcome prevalence and training sample sizes on inductive learning performance. We trained 3 inductive learning algorithms (MC4, IB, and Naïve-Bayes) on 60 simulated datasets of parsed radiology text reports labeled with 6 disease states. Data sets were constructed to define positive outcome states at 4 prevalence rates (1, 5, 10, 25, and 50%) in training set sizes of 200 and 2,000 cases. We found that the effect of outcome prevalence is significant when outcome classes drop below 10% of cases. The effect appeared independent of sample size, induction algorithm used, or class label. Work is needed to identify methods of improving classifier performance when output classes are rare.</p>\",\"PeriodicalId\":79712,\"journal\":{\"name\":\"Proceedings. AMIA Symposium\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2002-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2244149/pdf/procamiasymp00001-0560.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. AMIA Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. AMIA Symposium","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The effect of sample size and disease prevalence on supervised machine learning of narrative data.
This paper examines the independent effects of outcome prevalence and training sample sizes on inductive learning performance. We trained 3 inductive learning algorithms (MC4, IB, and Naïve-Bayes) on 60 simulated datasets of parsed radiology text reports labeled with 6 disease states. Data sets were constructed to define positive outcome states at 4 prevalence rates (1, 5, 10, 25, and 50%) in training set sizes of 200 and 2,000 cases. We found that the effect of outcome prevalence is significant when outcome classes drop below 10% of cases. The effect appeared independent of sample size, induction algorithm used, or class label. Work is needed to identify methods of improving classifier performance when output classes are rare.