Background: Copy number alterations (CNAs) represent an important component of genetic variations. Such alterations are related with certain type of cancer including those of the pancreas, colon, and breast, among others. CNAs have been used as biomarkers for cancer prognosis in multiple studies, but few works report on the relation of CNAs with the disease progression. Moreover, most studies do not consider the following two important issues. (I) The identification of CNAs in genes which are responsible for expression regulation is fundamental in order to define genetic events leading to malignant transformation and progression. (II) Most real domains are best described by structured data where instances of multiple types are related to each other in complex ways.
Results: Our main interest is to check whether the colorectal cancer (CRC) progression inference benefits when considering both (I) the expression levels of genes with CNAs, and (II) relationships (i.e. dissimilarities) between patients due to expression level differences of the altered genes. We first evaluate the accuracy performance of a state-of-the-art inference method (support vector machine) when subjects are represented only through sets of available attribute values (i.e. gene expression level). Then we check whether the inference accuracy improves, when explicitly exploiting the information mentioned above. Our results suggest that the CRC progression inference improves when the combined data (i.e. CNA and expression level) and the considered dissimilarity measures are applied.
Conclusions: Through our approach, classification is intuitively appealing and can be conveniently obtained in the resulting dissimilarity spaces. Different public datasets from Gene Expression Omnibus (GEO) were used to validate the results.