Predicting tumor drug response using cancer cell line drug response values for a large number of anti-cancer drugs is a significant challenge in personalized medicine. Predicting patient response to drugs from data obtained from preclinical models is made easier by the availability of different knowledge on cell lines and drugs. This paper proposes the TCLMF method, a predictive model for predicting drug response in tumor samples that was trained on preclinical samples and is based on the logistic matrix factorization approach. The TCLMF model is designed based on gene expression profiles, tissue type information, the chemical structure of drugs and drug sensitivity (IC 50) data from cancer cell lines. We use preclinical data from the Genomics of Drug Sensitivity in Cancer dataset (GDSC) to train the proposed drug response model, which we then use to predict drug sensitivity of samples from the Cancer Genome Atlas (TCGA) dataset. The TCLMF approach focuses on identifying successful features of cell lines and drugs in order to calculate the probability of the tumor samples being sensitive to drugs. The closest cell line neighbours for each tumor sample are calculated using a description of similarity between tumor samples and cell lines in this study. The drug response for a new tumor is then calculated by averaging the low-rank features obtained from its neighboring cell lines. We compare the results of the TCLMF model with the results of the previously proposed methods using two databases and two approaches to test the model's performance. In the first approach, 12 drugs with enough known clinical drug response, considered in previous methods, are studied. For 7 drugs out of 12, the TCLMF can significantly distinguish between patients that are resistance to these drugs and the patients that are sensitive to them. These approaches are converted to classification models using a threshold in the second approach, and the results are compared. The results demonstrate that the TCLMF method provides accurate predictions across the results of the other algorithms. Finally, we accurately classify tumor tissue type using the latent vectors obtained from TCLMF's logistic matrix factorization process. These findings demonstrate that the TCLMF approach produces effective latent vectors for tumor samples. The source code of the TCLMF method is available in https://github.com/emdadi/TCLMF.
Gene expression studies using xenograft transplants or co-culture systems, usually with mixed human and mouse cells, have proven to be valuable to uncover cellular dynamics during development or in disease models. However, the mRNA sequence similarities among species presents a challenge for accurate transcript quantification. To identify optimal strategies for analyzing mixed-species RNA sequencing data, we evaluate both alignment-dependent and alignment-independent methods. Alignment of reads to a pooled reference index is effective, particularly if optimal alignments are used to classify sequencing reads by species, which are re-aligned with individual genomes, generating [Formula: see text] accuracy across a range of species ratios. Alignment-independent methods, such as convolutional neural networks, which extract the conserved patterns of sequences from two species, classify RNA sequencing reads with over 85% accuracy. Importantly, both methods perform well with different ratios of human and mouse reads. While non-alignment strategies successfully partitioned reads by species, a more traditional approach of mixed-genome alignment followed by optimized separation of reads proved to be the more successful with lower error rates.
The succinylation modification of protein participates in the regulation of a variety of cellular processes. Identification of modified substrates with precise sites is the basis for understanding the molecular mechanism and regulation of succinylation. In this work, we picked and chose five superior feature codes: CKSAAP, ACF, BLOSUM62, AAindex, and one-hot, according to their performance in the problem of succinylation sites prediction. Then, LSTM network and CNN were used to construct four models: LSTM-CNN, CNN-LSTM, LSTM, and CNN. The five selected features were, respectively, input into each of these four models for training to compare the four models. Based on the performance of each model, the optimal model among them was chosen to construct a hybrid model DeepSucc that was composed of five sub-modules for integrating heterogeneous information. Under the 10-fold cross-validation, the hybrid model DeepSucc achieves 86.26% accuracy, 84.94% specificity, 87.57% sensitivity, 0.9406 AUC, and 0.7254 MCC. When compared with other prediction tools using an independent test set, DeepSucc outperformed them in sensitivity and MCC. The datasets and source codes can be accessed at https://github.com/1835174863zd/DeepSucc.
Tensor Robust Principal Component Analysis (TRPCA) has achieved promising results in the analysis of genomics data. However, the TRPCA model under the existing tensor singular value decomposition ([Formula: see text]-SVD) framework insufficiently extracts the potential low-rank structure of the data, resulting in suboptimal restored components. Simultaneously, the tensor nuclear norm (TNN) defined based on [Formula: see text]-SVD uses the same standard to handle various singular values. TNN ignores the difference of singular values, leading to the failure of the main information that needs to be well preserved. To preserve the heterogeneous structure in the low-rank information, we propose a novel TNN and extend it to the TRPCA model. Potential low-rank space may contain important information. We learn the low-rank structural information from the core tensor. The singular value space contains the association information between genes and cancers. The [Formula: see text]-shrinkage generalized threshold function is utilized to preserve the low-rank properties of larger singular values. The optimization problem is solved by the alternating direction method of the multiplier (ADMM) algorithm. Clustering and feature selection experiments are performed on the TCGA data set. The experimental results show that the proposed model is more promising than other state-of-the-art tensor decomposition methods.
With the rapid development of deep sequencing technologies, a large amount of high-throughput data has been available for studying the carcinogenic mechanism at the molecular level. It has been widely accepted that the development and progression of cancer are regulated by modules/pathways rather than individual genes. The investigation of identifying cancer-related active modules has received an extensive attention. In this paper, we put forward an identification method ModFinder by integrating both biological networks and gene expression profiles. More concretely, a gene scoring function is devised by using the regression model with [Formula: see text]-step random walk kernel, and the genes are ranked according to both of their active scores and degrees in the PPI network. Then a greedy algorithm NSEA is introduced to find an active module with high score and strong connectivity. Experiments were performed on both simulated data and real biological one, i.e. breast cancer and cervical cancer. Compared with the previous methods SigMod, LEAN and RegMod, ModFinder shows competitive performance. It can successfully identify a well-connected module that contains a large proportion of cancer-related genes, including some well-known oncogenes or tumor suppressors enriched in cancer-related pathways.
In this paper, we propose a new Bayesian approach for QTL mapping of family data. The main purpose is to model a phenotype as a function of QTLs' effects. The model considers the detailed familiar dependence and it does not rely on random effects. It combines the probability for Mendelian inheritance of parents' genotype and the correlation between flanking markers and QTLs. This is an advance when compared with models which use only Mendelian segregation or only the correlation between markers and QTLs to estimate transmission probabilities. We use the Bayesian approach to estimate the number of QTLs, their location and the additive and dominance effects. We compare the performance of the proposed method with variance component and LASSO models using simulated and GAW17 data sets. Under tested conditions, the proposed method outperforms other methods in aspects such as estimating the number of QTLs, the accuracy of the QTLs' position and the estimate of their effects. The results of the application of the proposed method to data sets exceeded all of our expectations.