Pub Date : 2023-01-01DOI: 10.1177/11769351231172609
Karine Pallier, Olivier Prot, Simone Naldi, Francisco Silva, Thierry Denis, Olivier Giry, Sophie Leobon, Elise Deluche, Nicole Tubiana-Mathieu
Background: The Regional Basis of Solid Tumor (RBST), a clinical data warehouse, centralizes information related to cancer patient care in 5 health establishments in 2 French departments.
Purpose: To develop algorithms matching heterogeneous data to "real" patients and "real" tumors with respect to patient identification (PI) and tumor identification (TI).
Methods: A graph database programed in java Neo4j was used to build the RBST with data from ~20 000 patients. The PI algorithm using the Levenshtein distance was based on the regulatory criteria identifying a patient. A TI algorithm was built on 6 characteristics: tumor location and laterality, date of diagnosis, histology, primary and metastatic status. Given the heterogeneous nature and semantics of the collected data, the creation of repositories (organ, synonym, and histology repositories) was required. The TI algorithm used the Dice coefficient to match tumors.
Results: Patients matched if there was complete agreement of the given name, surname, sex, and date/month/year of birth. These parameters were assigned weights of 28%, 28%, 21%, and 23% (with 18% for year, 2.5% for month, and 2.5% for day), respectively. The algorithm had a sensitivity of 99.69% (95% confidence interval [CI] [98.89%, 99.96%]) and a specificity of 100% (95% CI [99.72%, 100%]). The TI algorithm used repositories, weights were assigned to the diagnosis date and associated organ (37.5% and 37.5%, respectively), laterality (16%) histology (5%), and metastatic status (4%). This algorithm had a sensitivity of 71% (95% CI [62.68%, 78.25%]) and a specificity of 100% (95% CI [94.31%, 100%]).
Conclusion: The RBST encompasses 2 quality controls: PI and TI. It facilitates the implementation of transversal structuring and assessments of the performance of the provided care.
{"title":"Patient Identification and Tumor Identification Management: Quality Program in a Cancer Multicentric Clinical Data Warehouse.","authors":"Karine Pallier, Olivier Prot, Simone Naldi, Francisco Silva, Thierry Denis, Olivier Giry, Sophie Leobon, Elise Deluche, Nicole Tubiana-Mathieu","doi":"10.1177/11769351231172609","DOIUrl":"https://doi.org/10.1177/11769351231172609","url":null,"abstract":"<p><strong>Background: </strong>The Regional Basis of Solid Tumor (RBST), a clinical data warehouse, centralizes information related to cancer patient care in 5 health establishments in 2 French departments.</p><p><strong>Purpose: </strong>To develop algorithms matching heterogeneous data to \"real\" patients and \"real\" tumors with respect to patient identification (PI) and tumor identification (TI).</p><p><strong>Methods: </strong>A graph database programed in java Neo4j was used to build the RBST with data from ~20 000 patients. The PI algorithm using the Levenshtein distance was based on the regulatory criteria identifying a patient. A TI algorithm was built on 6 characteristics: tumor location and laterality, date of diagnosis, histology, primary and metastatic status. Given the heterogeneous nature and semantics of the collected data, the creation of repositories (organ, synonym, and histology repositories) was required. The TI algorithm used the Dice coefficient to match tumors.</p><p><strong>Results: </strong>Patients matched if there was complete agreement of the given name, surname, sex, and date/month/year of birth. These parameters were assigned weights of 28%, 28%, 21%, and 23% (with 18% for year, 2.5% for month, and 2.5% for day), respectively. The algorithm had a sensitivity of 99.69% (95% confidence interval [CI] [98.89%, 99.96%]) and a specificity of 100% (95% CI [99.72%, 100%]). The TI algorithm used repositories, weights were assigned to the diagnosis date and associated organ (37.5% and 37.5%, respectively), laterality (16%) histology (5%), and metastatic status (4%). This algorithm had a sensitivity of 71% (95% CI [62.68%, 78.25%]) and a specificity of 100% (95% CI [94.31%, 100%]).</p><p><strong>Conclusion: </strong>The RBST encompasses 2 quality controls: PI and TI. It facilitates the implementation of transversal structuring and assessments of the performance of the provided care.</p>","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":"22 ","pages":"11769351231172609"},"PeriodicalIF":2.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/f9/25/10.1177_11769351231172609.PMC10201142.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9888090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model's accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.
{"title":"Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers.","authors":"Lavanya C, Pooja S, Abhay H Kashyap, Abdur Rahaman, Swarna Niranjan, Vidya Niranjan","doi":"10.1177/11769351231167992","DOIUrl":"https://doi.org/10.1177/11769351231167992","url":null,"abstract":"<p><p>Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model's accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.</p>","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":"22 ","pages":"11769351231167992"},"PeriodicalIF":2.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/c4/97/10.1177_11769351231167992.PMC10126698.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9718472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1177/11769351231165181
Daniel Brough, Hope Amos, Karl Turley, Jake Murkin
Tumour volume is typically calculated using only length and width measurements, using width as a proxy for height in a 1:1 ratio. When tracking tumour growth over time, important morphological information and measurement accuracy is lost by ignoring height, which we show is a unique variable. Lengths, widths, and heights of 9522 subcutaneous tumours in mice were measured using 3D and thermal imaging. The average height:width ratio was found to be 1:3 proving that using width as a proxy for height overestimates tumour volume. Comparing volumes calculated with and without tumour height to the true volumes of excised tumours indeed showed that using the volume formula including height produced volumes 36X more accurate (based off of percentage difference). Monitoring the height:width relationship (prominence) across tumour growth curves indicated that prominence varied, and that height could change independent of width. Twelve cell lines were investigated individually; the scale of tumour prominence was cell line-dependent with relatively less prominent tumours (MC38, BL2, LL/2) and more prominent tumours (RENCA, HCT116) detected. Prominence trends across the growth cycle were also dependent on cell line; prominence was correlated with tumour growth in some cell lines (4T1, CT26, LNCaP), but not others (MC38, TC-1, LL/2). When pooled, invasive cell lines produced tumours that were significantly less prominent at volumes >1200 mm3 compared to non-invasive cell lines (P < .001). Modelling was used to show the impact of the increased accuracy gained by including height in volume calculations on several efficacy study outcomes. Variations in measurement accuracy contribute to experimental variation and irreproducibility of data, therefore we strongly advise researchers to measure height to improve accuracy in tumour studies.
{"title":"Trends in Subcutaneous Tumour Height and Impact on Measurement Accuracy.","authors":"Daniel Brough, Hope Amos, Karl Turley, Jake Murkin","doi":"10.1177/11769351231165181","DOIUrl":"https://doi.org/10.1177/11769351231165181","url":null,"abstract":"<p><p>Tumour volume is typically calculated using only length and width measurements, using width as a proxy for height in a 1:1 ratio. When tracking tumour growth over time, important morphological information and measurement accuracy is lost by ignoring height, which we show is a unique variable. Lengths, widths, and heights of 9522 subcutaneous tumours in mice were measured using 3D and thermal imaging. The average height:width ratio was found to be 1:3 proving that using width as a proxy for height overestimates tumour volume. Comparing volumes calculated with and without tumour height to the true volumes of excised tumours indeed showed that using the volume formula including height produced volumes 36X more accurate (based off of percentage difference). Monitoring the height:width relationship (prominence) across tumour growth curves indicated that prominence varied, and that height could change independent of width. Twelve cell lines were investigated individually; the scale of tumour prominence was cell line-dependent with relatively less prominent tumours (MC38, BL2, LL/2) and more prominent tumours (RENCA, HCT116) detected. Prominence trends across the growth cycle were also dependent on cell line; prominence was correlated with tumour growth in some cell lines (4T1, CT26, LNCaP), but not others (MC38, TC-1, LL/2). When pooled, invasive cell lines produced tumours that were significantly less prominent at volumes >1200 mm<sup>3</sup> compared to non-invasive cell lines (<i>P</i> < .001). Modelling was used to show the impact of the increased accuracy gained by including height in volume calculations on several efficacy study outcomes. Variations in measurement accuracy contribute to experimental variation and irreproducibility of data, therefore we strongly advise researchers to measure height to improve accuracy in tumour studies.</p>","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":"22 ","pages":"11769351231165181"},"PeriodicalIF":2.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/2a/51/10.1177_11769351231165181.PMC10126793.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9718474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1177/11769351221148592
Fritz F Parl
Different tumor types are characterized by unique histopathological patterns including distinctive nuclear architectures. I hypothesized that the difference in nuclear appearance is reflected in different nuclear maps of chromosome territories, the discrete regions occupied by individual chromosomes in the interphase nucleus. To test this hypothesis, I used interchromosomal translocations (ITLs) as an analytical tool to map chromosome territories in 11 different tumor types from the TCGA PanCancer database encompassing 6003 tumors with 5295 ITLs. For each chromosome I determined the number and percentage of all ITLs for any given tumor type. Chromosomes were ranked according to the frequency and percentage of ITLs per chromosome. The ranking showed similar patterns for all tumor types. Chromosomes 1, 8, 11, 17, and 19 were ranked in the top quarter, accounting for 35.2% of 5295 ITLs, whereas chromosomes 13, 15, 18, 21, and X were in the bottom quarter, accounting for only 10.5% ITLs. The correlation between the chromosome ranking in the total group of 6003 tumors and the ranking in individual tumor types was significant, ranging from P < .0001 to .0033. Thus, contrary to my hypothesis, different tumor types share a common nuclear map of chromosome territories. Based on the large number of ITLs in 11 different types of malignancy one can discern a shared pattern of chromosome territories in cancer and propose a probabilistic model of chromosomes 1, 8, 11, 17, 19 in the center of the nucleus and chromosomes 13, 15, 18, 21, X at the periphery.
{"title":"Different Tumor Types Share a Common Nuclear Map of Chromosome Territories.","authors":"Fritz F Parl","doi":"10.1177/11769351221148592","DOIUrl":"https://doi.org/10.1177/11769351221148592","url":null,"abstract":"<p><p>Different tumor types are characterized by unique histopathological patterns including distinctive nuclear architectures. I hypothesized that the difference in nuclear appearance is reflected in different nuclear maps of chromosome territories, the discrete regions occupied by individual chromosomes in the interphase nucleus. To test this hypothesis, I used interchromosomal translocations (ITLs) as an analytical tool to map chromosome territories in 11 different tumor types from the TCGA PanCancer database encompassing 6003 tumors with 5295 ITLs. For each chromosome I determined the number and percentage of all ITLs for any given tumor type. Chromosomes were ranked according to the frequency and percentage of ITLs per chromosome. The ranking showed similar patterns for all tumor types. Chromosomes 1, 8, 11, 17, and 19 were ranked in the top quarter, accounting for 35.2% of 5295 ITLs, whereas chromosomes 13, 15, 18, 21, and X were in the bottom quarter, accounting for only 10.5% ITLs. The correlation between the chromosome ranking in the total group of 6003 tumors and the ranking in individual tumor types was significant, ranging from <i>P</i> < .0001 to .0033. Thus, contrary to my hypothesis, different tumor types share a common nuclear map of chromosome territories. Based on the large number of ITLs in 11 different types of malignancy one can discern a shared pattern of chromosome territories in cancer and propose a probabilistic model of chromosomes 1, 8, 11, 17, 19 in the center of the nucleus and chromosomes 13, 15, 18, 21, X at the periphery.</p>","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":"22 ","pages":"11769351221148592"},"PeriodicalIF":2.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/cd/06/10.1177_11769351221148592.PMC9903037.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10747546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1177/11769351231161480
Tania Isabella Aravena, Elizabeth Valdés, Nicolás Ayala, Vívian D'Afonseca
Histone methyltransferases (HMTs) comprise a subclass of epigenetic regulators. Dysregulation of these enzymes results in aberrant epigenetic regulation, commonly observed in various tumor types, including hepatocellular adenocarcinoma (HCC). Probably, these epigenetic changes could lead to tumorigenesis processes. To predict how histone methyltransferase genes and their genetic alterations (somatic mutations, somatic copy number alterations, and gene expression changes) are involved in hepatocellular adenocarcinoma processes, we performed an integrated computational analysis of genetic alterations in 50 HMT genes present in hepatocellular adenocarcinoma. Biological data were obtained through the public repository with 360 samples from patients with hepatocellular carcinoma. Through these biological data, we identified 10 HMT genes (SETDB1, ASH1L, SMYD2, SMYD3, EHMT2, SETD3, PRDM14, PRDM16, KMT2C, and NSD3) with a significant genetic alteration rate (14%) within 360 samples. Of these 10 HMT genes, KMT2C and ASH1L have the highest mutation rate in HCC samples, 5.6% and 2.8%, respectively. Regarding somatic copy number alteration, ASH1L and SETDB1 are amplified in several samples, while SETD3, PRDM14, and NSD3 showed a high rate of large deletion. Finally, SETDB1, SETD3, PRDM14, and NSD3 could play an important role in the progression of hepatocellular adenocarcinoma since alterations in these genes lead to a decrease in patient survival, unlike patients who present these genes without genetic alterations. Our computational analysis provides new insights that help to understand how HMTs are associated with hepatocellular carcinoma, as well as provide a basis for future experimental investigations using HMTs as genetic targets against hepatocellular carcinoma.
{"title":"A Computational Approach to Predict the Role of Genetic Alterations in Methyltransferase Histones Genes With Implications in Liver Cancer.","authors":"Tania Isabella Aravena, Elizabeth Valdés, Nicolás Ayala, Vívian D'Afonseca","doi":"10.1177/11769351231161480","DOIUrl":"https://doi.org/10.1177/11769351231161480","url":null,"abstract":"<p><p>Histone methyltransferases (HMTs) comprise a subclass of epigenetic regulators. Dysregulation of these enzymes results in aberrant epigenetic regulation, commonly observed in various tumor types, including hepatocellular adenocarcinoma (HCC). Probably, these epigenetic changes could lead to tumorigenesis processes. To predict how histone methyltransferase genes and their genetic alterations (somatic mutations, somatic copy number alterations, and gene expression changes) are involved in hepatocellular adenocarcinoma processes, we performed an integrated computational analysis of genetic alterations in 50 HMT genes present in hepatocellular adenocarcinoma. Biological data were obtained through the public repository with 360 samples from patients with hepatocellular carcinoma. Through these biological data, we identified 10 HMT genes (<i>SETDB1, ASH1L, SMYD2, SMYD3, EHMT2, SETD3, PRDM14, PRDM16, KMT2C</i>, and <i>NSD3</i>) with a significant genetic alteration rate (14%) within 360 samples. Of these 10 HMT genes, <i>KMT2C</i> and <i>ASH1L</i> have the highest mutation rate in HCC samples, 5.6% and 2.8%, respectively. Regarding somatic copy number alteration, <i>ASH1L</i> and <i>SETDB1</i> are amplified in several samples, while <i>SETD3, PRDM14</i>, and <i>NSD3</i> showed a high rate of large deletion. Finally, <i>SETDB1, SETD3, PRDM14</i>, and <i>NSD3</i> could play an important role in the progression of hepatocellular adenocarcinoma since alterations in these genes lead to a decrease in patient survival, unlike patients who present these genes without genetic alterations. Our computational analysis provides new insights that help to understand how HMTs are associated with hepatocellular carcinoma, as well as provide a basis for future experimental investigations using HMTs as genetic targets against hepatocellular carcinoma.</p>","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":"22 ","pages":"11769351231161480"},"PeriodicalIF":2.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/1e/b4/10.1177_11769351231161480.PMC10064455.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9610566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1177/11769351231159893
Mehdi Hamaneh, Yi-Kuo Yu
Motivation: The PAM50 signature/method is widely used for intrinsic subtyping of breast cancer samples. However, depending on the number and composition of the samples included in a cohort, the method may assign different subtypes to the same sample. This lack of robustness is mainly due to the fact that PAM50 subtracts a reference profile, which is computed using all samples in the cohort, from each sample before classification. In this paper we propose modifications to PAM50 to develop a simple and robust single-sample classifier, called MPAM50, for intrinsic subtyping of breast cancer. Like PAM50, the modified method uses a nearest centroid approach for classification, but the centroids are computed differently, and the distances to the centroids are determined using an alternative method. Additionally, MPAM50 uses unnormalized expression values for classification and does not subtract a reference profile from the samples. In other words, MPAM50 classifies each sample independently, and so avoids the previously mentioned robustness issue.
Results: A training set was employed to find the new MPAM50 centroids. MPAM50 was then tested on 19 independent datasets (obtained using various expression profiling technologies) containing 9637 samples. Overall good agreement was observed between the PAM50- and MPAM50-assigned subtypes with a median accuracy of 0.792, which (we show) is comparable with the median concordance between various implementations of PAM50. Additionally, MPAM50- and PAM50-assigned intrinsic subtypes were found to agree comparably with the reported clinical subtypes. Also, survival analyses indicated that MPAM50 preserves the prognostic value of the intrinsic subtypes. These observations demonstrate that MPAM50 can replace PAM50 without loss of performance. On the other hand, MPAM50 was compared with 2 previously published single-sample classifiers, and with 3 alternative modified PAM50 approaches. The results indicated a superior performance by MPAM50.
Conclusions: MPAM50 is a robust, simple, and accurate single-sample classifier of intrinsic subtypes of breast cancer.
{"title":"A Simple Method for Robust and Accurate Intrinsic Subtyping of Breast Cancer.","authors":"Mehdi Hamaneh, Yi-Kuo Yu","doi":"10.1177/11769351231159893","DOIUrl":"https://doi.org/10.1177/11769351231159893","url":null,"abstract":"<p><strong>Motivation: </strong>The PAM50 signature/method is widely used for intrinsic subtyping of breast cancer samples. However, depending on the number and composition of the samples included in a cohort, the method may assign different subtypes to the same sample. This lack of robustness is mainly due to the fact that PAM50 subtracts a reference profile, which is computed using all samples in the cohort, from each sample before classification. In this paper we propose modifications to PAM50 to develop a simple and robust single-sample classifier, called MPAM50, for intrinsic subtyping of breast cancer. Like PAM50, the modified method uses a nearest centroid approach for classification, but the centroids are computed differently, and the distances to the centroids are determined using an alternative method. Additionally, MPAM50 uses unnormalized expression values for classification and does not subtract a reference profile from the samples. In other words, MPAM50 classifies each sample independently, and so avoids the previously mentioned robustness issue.</p><p><strong>Results: </strong>A training set was employed to find the new MPAM50 centroids. MPAM50 was then tested on 19 independent datasets (obtained using various expression profiling technologies) containing 9637 samples. Overall good agreement was observed between the PAM50- and MPAM50-assigned subtypes with a median accuracy of 0.792, which (we show) is comparable with the median concordance between various implementations of PAM50. Additionally, MPAM50- and PAM50-assigned intrinsic subtypes were found to agree comparably with the reported clinical subtypes. Also, survival analyses indicated that MPAM50 preserves the prognostic value of the intrinsic subtypes. These observations demonstrate that MPAM50 can replace PAM50 without loss of performance. On the other hand, MPAM50 was compared with 2 previously published single-sample classifiers, and with 3 alternative modified PAM50 approaches. The results indicated a superior performance by MPAM50.</p><p><strong>Conclusions: </strong>MPAM50 is a robust, simple, and accurate single-sample classifier of intrinsic subtypes of breast cancer.</p>","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":"22 ","pages":"11769351231159893"},"PeriodicalIF":2.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/38/68/10.1177_11769351231159893.PMC10052604.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9234981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Objectives: This study examined prescription NSAIDs as one of the leading predictors of incident depression and assessed the direction of the association among older cancer survivors with osteoarthritis.
Methods: This study used a retrospective cohort (N = 14, 992) of older adults with incident cancer (breast, prostate, colorectal cancers, or non-Hodgkin's lymphoma) and osteoarthritis. We used the longitudinal data from the linked Surveillance, Epidemiology, and End Results -Medicare data for the study period from 2006 through 2016, with a 12-month baseline and 12-month follow-up period. Cumulative NSAIDs days was assessed during the baseline period and incident depression was assessed during the follow-up period. An eXtreme Gradient Boosting (XGBoost) model was built with 10-fold repeated stratified cross-validation and hyperparameter tuning using the training dataset. The final model selected from the training data demonstrated high performance (Accuracy: 0.82, Recall: 0.75, Precision: 0.75) when applied to the test data. SHapley Additive exPlanations (SHAP) was used to interpret the output from the XGBoost model.
Results: Over 50% of the study cohort had at least one prescption of NSAIDs. Nearly 13% of the cohort were diagnosed with incident depression, with the rates ranging between 7.4% for prostate cancer and 17.0% for colorectal cancer. The highest incident depression rate of 25% was observed at 90 and 120 cumulative NSAIDs days thresholds. Cumulative NSAIDs days was the sixth leading predictor of incident depression among older adults with OA and cancer. Age, education, care fragmentation, polypharmacy, and zip code level poverty were the top 5 predictors of incident depression.
Conclusion: Overall, 1 in 8 older adults with cancer and OA were diagnosed with incident depression. Cumulative NSAIDs days was the sixth leading predictor with an overall positive association with incident depression. However, the association was complex and varied by the cumulative NSAIDs days.
{"title":"Prescription Non-Steroidal Anti-Inflammatory Drugs (NSAIDs) and Incidence of Depression Among Older Cancer Survivors With Osteoarthritis: A Machine Learning Analysis.","authors":"Nazneen Fatima Shaikh, Chan Shen, Traci LeMasters, Nilanjana Dwibedi, Amit Ladani, Usha Sambamoorthi","doi":"10.1177/11769351231165161","DOIUrl":"https://doi.org/10.1177/11769351231165161","url":null,"abstract":"<p><strong>Objectives: </strong>This study examined prescription NSAIDs as one of the leading predictors of incident depression and assessed the direction of the association among older cancer survivors with osteoarthritis.</p><p><strong>Methods: </strong>This study used a retrospective cohort (N = 14, 992) of older adults with incident cancer (breast, prostate, colorectal cancers, or non-Hodgkin's lymphoma) and osteoarthritis. We used the longitudinal data from the linked Surveillance, Epidemiology, and End Results -Medicare data for the study period from 2006 through 2016, with a 12-month baseline and 12-month follow-up period. Cumulative NSAIDs days was assessed during the baseline period and incident depression was assessed during the follow-up period. An eXtreme Gradient Boosting (XGBoost) model was built with 10-fold repeated stratified cross-validation and hyperparameter tuning using the training dataset. The final model selected from the training data demonstrated high performance (Accuracy: 0.82, Recall: 0.75, Precision: 0.75) when applied to the test data. SHapley Additive exPlanations (SHAP) was used to interpret the output from the XGBoost model.</p><p><strong>Results: </strong>Over 50% of the study cohort had at least one prescption of NSAIDs. Nearly 13% of the cohort were diagnosed with incident depression, with the rates ranging between 7.4% for prostate cancer and 17.0% for colorectal cancer. The highest incident depression rate of 25% was observed at 90 and 120 cumulative NSAIDs days thresholds. Cumulative NSAIDs days was the sixth leading predictor of incident depression among older adults with OA and cancer. Age, education, care fragmentation, polypharmacy, and zip code level poverty were the top 5 predictors of incident depression.</p><p><strong>Conclusion: </strong>Overall, 1 in 8 older adults with cancer and OA were diagnosed with incident depression. Cumulative NSAIDs days was the sixth leading predictor with an overall positive association with incident depression. However, the association was complex and varied by the cumulative NSAIDs days.</p>","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":"22 ","pages":"11769351231165161"},"PeriodicalIF":2.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/25/bc/10.1177_11769351231165161.PMC10123903.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9356662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1177/11769351231190477
Yongjun Liu, Heping Zhang, Yuqing Xu, Yao-Zhong Liu, David P Al-Adra, Matthew M Yeh, Zhengjun Zhang
Hepatocellular carcinoma (HCC) is one of the most fatal cancers in the world. There is an urgent need to understand the molecular background of HCC to facilitate the identification of biomarkers and discover effective therapeutic targets. Published transcriptomic studies have reported a large number of genes that are individually significant for HCC. However, reliable biomarkers remain to be determined. In this study, built on max-linear competing risk factor models, we developed a machine learning analytical framework to analyze transcriptomic data to identify the most miniature set of differentially expressed genes (DEGs). By analyzing 9 public whole-transcriptome datasets (containing 1184 HCC samples and 672 nontumor controls), we identified 5 critical differentially expressed genes (DEGs) (ie, CCDC107, CXCL12, GIGYF1, GMNN, and IFFO1) between HCC and control samples. The classifiers built on these 5 DEGs reached nearly perfect performance in identification of HCC. The performance of the 5 DEGs was further validated in a US Caucasian cohort that we collected (containing 17 HCC with paired nontumor tissue). The conceptual advance of our work lies in modeling gene-gene interactions and correcting batch effect in the analytic framework. The classifiers built on the 5 DEGs demonstrated clear signature patterns for HCC. The results are interpretable, robust, and reproducible across diverse cohorts/populations with various disease etiologies, indicating the 5 DEGs are intrinsic variables that can describe the overall features of HCC at the genomic level. The analytical framework applied in this study may pave a new way for improving transcriptome profiling analysis of human cancers.
{"title":"Five Critical Gene-Based Biomarkers With Optimal Performance for Hepatocellular Carcinoma.","authors":"Yongjun Liu, Heping Zhang, Yuqing Xu, Yao-Zhong Liu, David P Al-Adra, Matthew M Yeh, Zhengjun Zhang","doi":"10.1177/11769351231190477","DOIUrl":"https://doi.org/10.1177/11769351231190477","url":null,"abstract":"Hepatocellular carcinoma (HCC) is one of the most fatal cancers in the world. There is an urgent need to understand the molecular background of HCC to facilitate the identification of biomarkers and discover effective therapeutic targets. Published transcriptomic studies have reported a large number of genes that are individually significant for HCC. However, reliable biomarkers remain to be determined. In this study, built on max-linear competing risk factor models, we developed a machine learning analytical framework to analyze transcriptomic data to identify the most miniature set of differentially expressed genes (DEGs). By analyzing 9 public whole-transcriptome datasets (containing 1184 HCC samples and 672 nontumor controls), we identified 5 critical differentially expressed genes (DEGs) (ie, CCDC107, CXCL12, GIGYF1, GMNN, and IFFO1) between HCC and control samples. The classifiers built on these 5 DEGs reached nearly perfect performance in identification of HCC. The performance of the 5 DEGs was further validated in a US Caucasian cohort that we collected (containing 17 HCC with paired nontumor tissue). The conceptual advance of our work lies in modeling gene-gene interactions and correcting batch effect in the analytic framework. The classifiers built on the 5 DEGs demonstrated clear signature patterns for HCC. The results are interpretable, robust, and reproducible across diverse cohorts/populations with various disease etiologies, indicating the 5 DEGs are intrinsic variables that can describe the overall features of HCC at the genomic level. The analytical framework applied in this study may pave a new way for improving transcriptome profiling analysis of human cancers.","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":"22 ","pages":"11769351231190477"},"PeriodicalIF":2.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/11/97/10.1177_11769351231190477.PMC10413891.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10305114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1177/11769351221144132
Jianglin Feng, Esteban Astiazaran Symonds, Jason H Karnes
Epidemiologic evidence for the association of cholesterol and breast cancer is inconsistent. Several factors may contribute to this inconsistency, including limited sample sizes, confounding effects of antihyperlipidemic treatment, age, and body mass index, and the assumption that the association follows a simple linear function. Here, we aimed to address these factors by combining visualization and quantification a large-scale contemporary electronic health record database (the All of Us Research Program). We find clear visual and quantitative evidence that breast cancer is strongly, positively, and near-linearly associated with total cholesterol and low-density lipoprotein cholesterol, but not associated with triglycerides. The association of breast cancer with high-density lipoprotein cholesterol was non-linear and age dependent. Standardized odds ratios were 2.12 (95% confidence interval 1.9-2.48), P = 5.6 × 10-31 for total cholesterol; 1.99 (1.75-2.26), P = 2.6 × 10-26 for low-density lipoprotein cholesterol; 1.69 (1.3-2.2), P = 9.0 × 10-5 for high-density lipoprotein cholesterol at age < 56; and 0.65 (0.55-0.78), P = 1.2 × 10-6 for high-density lipoprotein cholesterol at age ⩾ 56. The inclusion of the lipid levels measured after antihyperlipidemic treatment in the analysis results in erroneous associations. We demonstrate that the use of the logistic regression without inspecting risk variable linearity and accounting for confounding effects may lead to inconsistent results.
{"title":"Visualization and Quantification of the Association Between Breast Cancer and Cholesterol in the All of Us Research Program.","authors":"Jianglin Feng, Esteban Astiazaran Symonds, Jason H Karnes","doi":"10.1177/11769351221144132","DOIUrl":"https://doi.org/10.1177/11769351221144132","url":null,"abstract":"<p><p>Epidemiologic evidence for the association of cholesterol and breast cancer is inconsistent. Several factors may contribute to this inconsistency, including limited sample sizes, confounding effects of antihyperlipidemic treatment, age, and body mass index, and the assumption that the association follows a simple linear function. Here, we aimed to address these factors by combining visualization and quantification a large-scale contemporary electronic health record database (the All of Us Research Program). We find clear visual and quantitative evidence that breast cancer is strongly, positively, and near-linearly associated with total cholesterol and low-density lipoprotein cholesterol, but not associated with triglycerides. The association of breast cancer with high-density lipoprotein cholesterol was non-linear and age dependent. Standardized odds ratios were 2.12 (95% confidence interval 1.9-2.48), <i>P</i> = 5.6 × 10<sup>-31</sup> for total cholesterol; 1.99 (1.75-2.26), <i>P</i> = 2.6 × 10<sup>-26</sup> for low-density lipoprotein cholesterol; 1.69 (1.3-2.2), <i>P</i> = 9.0 × 10<sup>-5</sup> for high-density lipoprotein cholesterol at age < 56; and 0.65 (0.55-0.78), <i>P</i> = 1.2 × 10<sup>-6</sup> for high-density lipoprotein cholesterol at age ⩾ 56. The inclusion of the lipid levels measured after antihyperlipidemic treatment in the analysis results in erroneous associations. We demonstrate that the use of the logistic regression without inspecting risk variable linearity and accounting for confounding effects may lead to inconsistent results.</p>","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":"22 ","pages":"11769351221144132"},"PeriodicalIF":2.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/8b/89/10.1177_11769351221144132.PMC9841847.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10550794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1177/11769351221147244
K Chandrashekar, Anagha S Setlur, Adithya Sabhapathi C, Satyam Suresh Raiker, Satyam Singh, Vidya Niranjan
Using a decision support system (DSS) that classifies various cancers provides support to the clinicians/researchers to make better decisions that can aid in early cancer diagnosis, thereby reducing chances of incorrect disease diagnosis. Thus, this work aimed at designing a classification model that can predict accurately for 5 different cancer types comprising of 20 cancer exomes, using the mutations identified from whole exome cancer analysis. Initially, a basic model was designed using supervised machine learning classification algorithms such as K-nearest neighbor (KNN), support vector machine (SVM), decision tree, naïve bayes and random forest (RF), among which decision tree and random forest performed better in terms of preliminary model accuracy. However, output predictions were incorrect due to less training scores. Thus, 16 essential features were then selected for model improvement using 2 approaches. All imbalanced datasets were balanced using SMOTE. In the first approach, all features from 20 cancer exome datasets were trained and models were designed using decision tree and random forest. Balanced datasets for decision tree model showed an accuracy of 77%, while with the RF model, the accuracy improved to 82% where all 5 cancer types were predicted correctly. Area under the curve for RF model was closer to 1, than decision tree model. In the second approach, all 15 datasets were trained, while 5 were tested. However, only 2 cancer types were predicted correctly. To cross validate RF model, Matthew's correlation co-efficient (MCC) test was performed. For method 1, the MCC test and MCC cross validation was found to be 0.7796 and 0.9356 respectively. Likewise, for second approach, MCC was observed to be 0.9365, corroborating the accuracy of the designed model. The model was successfully deployed using Streamlit as a web application for easy use. This study presents insights for allowing easy cancer classifications.
{"title":"Decision Support System and Web-Application Using Supervised Machine Learning Algorithms for Easy Cancer Classifications.","authors":"K Chandrashekar, Anagha S Setlur, Adithya Sabhapathi C, Satyam Suresh Raiker, Satyam Singh, Vidya Niranjan","doi":"10.1177/11769351221147244","DOIUrl":"https://doi.org/10.1177/11769351221147244","url":null,"abstract":"<p><p>Using a decision support system (DSS) that classifies various cancers provides support to the clinicians/researchers to make better decisions that can aid in early cancer diagnosis, thereby reducing chances of incorrect disease diagnosis. Thus, this work aimed at designing a classification model that can predict accurately for 5 different cancer types comprising of 20 cancer exomes, using the mutations identified from whole exome cancer analysis. Initially, a basic model was designed using supervised machine learning classification algorithms such as K-nearest neighbor (KNN), support vector machine (SVM), decision tree, naïve bayes and random forest (RF), among which decision tree and random forest performed better in terms of preliminary model accuracy. However, output predictions were incorrect due to less training scores. Thus, 16 essential features were then selected for model improvement using 2 approaches. All imbalanced datasets were balanced using SMOTE. In the first approach, all features from 20 cancer exome datasets were trained and models were designed using decision tree and random forest. Balanced datasets for decision tree model showed an accuracy of 77%, while with the RF model, the accuracy improved to 82% where all 5 cancer types were predicted correctly. Area under the curve for RF model was closer to 1, than decision tree model. In the second approach, all 15 datasets were trained, while 5 were tested. However, only 2 cancer types were predicted correctly. To cross validate RF model, Matthew's correlation co-efficient (MCC) test was performed. For method 1, the MCC test and MCC cross validation was found to be 0.7796 and 0.9356 respectively. Likewise, for second approach, MCC was observed to be 0.9365, corroborating the accuracy of the designed model. The model was successfully deployed using Streamlit as a web application for easy use. This study presents insights for allowing easy cancer classifications.</p>","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":"22 ","pages":"11769351221147244"},"PeriodicalIF":2.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/c2/da/10.1177_11769351221147244.PMC9880585.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10591008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}