Rebecca H.K. Emanuel , Paul D. Docherty , Helen Lunt , Rua Murray , Rebecca E. Campbell
{"title":"Clustering polycystic ovary syndrome laboratory results extracted from a large internet forum with machine learning","authors":"Rebecca H.K. Emanuel , Paul D. Docherty , Helen Lunt , Rua Murray , Rebecca E. Campbell","doi":"10.1016/j.ibmed.2024.100135","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>Polycystic Ovary Syndrome (PCOS) is reported to affect between 4% and 21% of reproductive aged people with ovaries. It is a heterogeneous condition with a lack of established phenotypes that address the range of reproductive and metabolic features present in PCOS. These reproductive and metabolic features may result in patients undergoing a variety of relevant laboratory tests. Previous work has led to the gathering of laboratory test results from a PCOS specific forum, hosted on a website called reddit.</p></div><div><h3>Objectives</h3><p>In this paper, laboratory results and body mass index (BMI) posted on the PCOS reddit forum were clustered to show the usefulness of the PCOS forum for PCOS research and validate existing PCOS phenotypes or discover other appropriate phenotypes.</p></div><div><h3>Methods and results</h3><p>Over 1500 sets of PCOS-related reddit laboratory test results and BMIs were clustered using nearest neighbour imputation and K-means clustering. However, only non-imputed data was included in the final clusters. Kernel Density Estimation plots were used to display the distinct clusters. The clustered test results suggested the existence of distinct metabolic and reproductive phenotypes, as well as a group displaying mild features of both types of dysregulations and a group skewed towards normal results. It was also possible to separate the groups further into distinct hypothyroid groups within the mixed dysregulation group and to separate insulin resistant and diabetes-like groups within the metabolic group.</p></div><div><h3>Conclusions</h3><p>This research further validates the usefulness of exploring alternate data sources in the age of the internet and machine learning. The reddit clusters reinforced the existing notion that people with PCOS can be separated into a primarily metabolic pathology group, a primarily reproductive pathology group and an in between group with pathology in both domains.</p></div>","PeriodicalId":73399,"journal":{"name":"Intelligence-based medicine","volume":"9 ","pages":"Article 100135"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666521224000024/pdfft?md5=87b2d688b9b327bd7f8d3d181ee40e71&pid=1-s2.0-S2666521224000024-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligence-based medicine","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666521224000024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background
Polycystic Ovary Syndrome (PCOS) is reported to affect between 4% and 21% of reproductive aged people with ovaries. It is a heterogeneous condition with a lack of established phenotypes that address the range of reproductive and metabolic features present in PCOS. These reproductive and metabolic features may result in patients undergoing a variety of relevant laboratory tests. Previous work has led to the gathering of laboratory test results from a PCOS specific forum, hosted on a website called reddit.
Objectives
In this paper, laboratory results and body mass index (BMI) posted on the PCOS reddit forum were clustered to show the usefulness of the PCOS forum for PCOS research and validate existing PCOS phenotypes or discover other appropriate phenotypes.
Methods and results
Over 1500 sets of PCOS-related reddit laboratory test results and BMIs were clustered using nearest neighbour imputation and K-means clustering. However, only non-imputed data was included in the final clusters. Kernel Density Estimation plots were used to display the distinct clusters. The clustered test results suggested the existence of distinct metabolic and reproductive phenotypes, as well as a group displaying mild features of both types of dysregulations and a group skewed towards normal results. It was also possible to separate the groups further into distinct hypothyroid groups within the mixed dysregulation group and to separate insulin resistant and diabetes-like groups within the metabolic group.
Conclusions
This research further validates the usefulness of exploring alternate data sources in the age of the internet and machine learning. The reddit clusters reinforced the existing notion that people with PCOS can be separated into a primarily metabolic pathology group, a primarily reproductive pathology group and an in between group with pathology in both domains.