{"title":"SEEDCCA: An Integrated R-Package for Canonical Correlation Analysis and Partial Least Squares","authors":"Boyoung Kim, Yunju Im, Keun Yoo Jae","doi":"10.32614/rj-2021-026","DOIUrl":null,"url":null,"abstract":"Canonical correlation analysis (CCA) has a long history as an explanatory statistical method in high-dimensional data analysis and has been successfully applied in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. The so-called seedCCA is a newly developed R package, and it implements not only the standard and seeded CCA but also partial least squares. The package enables us to fit CCA to large-p and small-n data. The paper provides a complete guide. Also, the seeded CCA application results are compared with the regularized CCA in the existing R package. It is believed that the package along with the paper will contribute to highdimensional data analysis in various science field practitioners and that the statistical methodologies in multivariate analysis become more fruitful. Introduction Explanatory studies are important to identify patterns and special structure in data prior to develop a specific model. When a study between two sets of a p-dimensional random variables X (X ∈ Rp) and a r-dimensional random variable Y (Y ∈ Rr), are of primary interest, one of the popular explanatory statistical methods would be canonical correlation analysis (CCA; Hotelling (1936)). The main goal of CCA is the dimension reduction of two sets of variables by measuring an association between the two sets. For this, pairs of linear combinations of variables are constructed by maximizing the Pearson correlation. The CCA has successful application in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. In Lee and Yoo (2014) it is shown that the CCA can be used as a dimension reduction tool for high-dimensional data, but also it is connected to least square estimator. Therefore, the CCA is not only explanatory and dimension reduction method but also can be utilized as alternative of least square estimation. If max(p, r) is bigger than or equal to the sample size, n, usual CCA application is not plausible due to no incapability of inverting sample covariance matrices. To overcome this, a regularized CCA is developed by Leurgans et al. (1993), whose idea was firstly suggested in Vinod (1976). In practice, the CCA package by González et al. (2008) can implement a version of the regularized CCA. To make the sample covariance matrices saying Σ̂x and Σ̂y, invertible, in González et al. (2008), they are replaced with Σ̂ λ1 x = Σ̂x + λ1Ip and Σ̂ λ2 y = Σ̂y + λ1Ir. The optimal values of λ1 and λ2 are chosen by maximizing a cross-validation score throughout the two-dimensional grid search. Although it is discussed that a relatively small grid of reasonable values for λ1 and λ2 can lesson intensive computing in González et al. (2008), it is still time-consuming as observed in later sections. Additionally, fast regularized CCA and robust CCA via projection-pursuit are recently developed in Cruz-Cano (2012) and Alfons et al. (2016), respectively. Another version of CCA to handle max(p, r) > n is the so-called seeded canonical correlation analysis proposed by Im et al. (2014) Since the seeded CCA does not require any regularization procedure, which is computationally intensive, its implementation to larger data is quite fast. The seeded CCA requires two steps. In the initial step, a set of variables bigger than n is initially reduced based on iterative projections. In next step, the standard CCA is applied to two sets of variables acquired from the initial step to finalize CCA of data. Another advantage is that the procedure of the seeded CCA has a close relation with partial least square, which is one of the popular statistical methods for large p-small n data, so the seed CCA can yield the PLS estimates. The seedCCA package is recently developed mainly to implement the seeded CCA. However, the package can fit a collection of the statistical methodologies, which are standard canonical correlation and partial least squares with uni/multi-dimensional responses including the seeded CCA. The package has been already uploaded to CRAN (https://cran.r-project.org/web/packages/ seedCCA/index.html). The main goal of the paper is to introduce and illustrate the seedCCA package. For this, three real data are fitted by the standard CCA, the seeded CCA and partial least square, and two of the three data are available in the package. One of them has been analyzed in González et al. (2008). So, the The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 2 implementation results by the seeded and regularized CCA are closely compared. The organization of the paper is as follows. The collection of three methodologies is discussed in Section 2. The implementation of seedCCA is illustrated and compared with CCA in Section 3. In Section 4, we summarize the work. We will use the following notations throughout the rest of the paper. A p-dimensional random variable X will be denoted as X ∈ Rp. So, X ∈ Rp means a random variable, although there is no specific mention. For X ∈ Rp and Y ∈ Rr, we define that cov(X) = Σx, cov(Y) = Σy, cov(X, Y) = Σxy and cov(Y, X) = Σyx. And, it is assumed that Σx and Σy are positive-definite. Collection of implemented methodologies in seedCCA Canonical correlation analysis Suppose the two sets of variable X ∈ Rp and Y ∈ Rr and consider their linear combinations of U = aTX and V = bTY. Then we have var(U) = aΣxa, var(V) = bΣyb, and cov(U, V) = aΣxyb, where a ∈ Rp×1 and b ∈ Rr×1. Then Pearson-correlation between U and V is as follows: cor(U, V) = aΣxyb √ aTΣxa √ bTΣyb . (1) We seek to find a and b to maximize cor(U, V) with satisfying the following criteria. 1. The first canonical variate pair (U1 = a1 X, V1 = b T 1 Y) is obtained from maximizing (1). 2. The second canonical variate pair (U2 = a2 X, V2 = b T 2 Y) is constructed from the maximization of (1) with restriction that var(U2) = var(V2) = 1 and (U1, V1) and (U2, V2) are uncorrelated. 3. At the k step, the kth canonical variate pair (Uk = ak X, Vk = b T k Y) is obtained from the maximization of (1) with restriction that var(Uk) = var(Vk) = 1 and (Uk, Vk) are uncorrelated with the previous (k− 1) canonical variate pairs. 4. Repeat Steps 1 to 3 until k becomes q (= min(p, r)). 5. Select the first d pairs of (Uk, Vk) to represent the relationship between X and Y. Under this criteria, the pairs (ai, bi) are constructed as follows: ai = Σ−1/2 x ψi and bi = Σ −1/2 y φi for i = 1, . . . , q, where (ψ1, ..., ψq) and (φ1, ..., φq) are, respectively, the q eigenvectors of Σ−1/2 x ΣxyΣ −1 y ΣyxΣ −1/2 y and Σ−1/2 y ΣyxΣ −1 x ΣxyΣ −1/2 x with the corresponding common ordered-eigenvalues of ρ∗2 1 ≥ · · · ≥ ρ∗2 q ≥ 0. Then matrices of Mx = (a1, ..., ad) and My = (b1, ..., bd) are called canonical coefficient matrices for d = 1, ..., q. Also, Mx X and My Y are called canonical variates. In sample, the population quantities are replaced with their usual moment estimators. For more details regarding this standard CCA, readers may refer Johnson and Wichern (2007). Seeded canonical correlation analysis Since the standard CCA application requires the inversion of Σ̂x and Σ̂y in practice, it is not plausible for high-dimensional data with max(p, r) > n. In Im et al. (2014) a seeded canonical correlation analysis approach is proposed to overcome this deficit. The seeded CCA is a two step procedure, consisting of initialized and finalized steps. In the initialized step, the original two sets of variables are reduced to m-dimensional pairs without loss of information on the CCA application. In the initialized step, it is essential to force m << n. In the finalized step, the standard CCA is implemented to the initially-reduced pairs for the repairing and orthonormality. More detailed discussion on the seeded CCA is as follows in next subsections. Development Define a notation of S(M) as the subspace spanned by the columns of M ∈ Rp×r . Lee and Yoo (2014) show the following relation: S(Mx) = S(Σ−1 x Σxy) and S(My) = S(Σ−1 y Σyx). (2) The relation in (2) directly indicates that Mx and My form basis matrices of S(Σ−1 x Σxy) and S(Σ−1 y Σyx) and that Mx and My can be restored from Σ−1 x Σxy and Σ −1 y Σyx. The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 3 Now, we define the following two matrices: Rx,u1 ∈ Rp×ru1 = (Σxy, ΣxΣxy, . . . , Σu1−1 x Σxy) and Ry,u2 ∈ Rr×pu2 = (Σyx, ΣyΣyx, . . . , Σu2−1 y Σyx). (3) In Rx,u1 and Ry,u2 , the numbers of u1 and u2 are called termination indexes, and they decide the number of projections of Σxy and Σyx onto Σx and Σy, respectively. Also define that Mx,u1 ∈ R p×r = Rx,u1 (R T x,u1 ΣxRx,u1 ) Rx,u1 Σxy and My,u2 ∈ R r×p = Ry,u2 (R T y,u2 ΣyRy,u2 ) Ry,u2 Σyx. (4) In Cook et al. (2007) it is shown that S(Mx,u1 ) = S(Σ −1 x Σxy) and S(My,u2 ) = S(Σ −1 y Σyx) in (4), and hence Mx,u1 and M 0 y,u2 can be used to infer Mx and My, respectively. One clear advantage to use M 0 x,u1 and My,u2 is no need of the inversion of Σx and Σy. Practically, it is important to select proper values for the termination indexes u1 and u2. For this define that ∆x,u1 = M 0 x,u1+1 −M 0 x,u1 and ∆y,u2 = M 0 y,u2+1 −M 0 y,u2 . Finally, the following measure for increment of u1 and u2 is defined: nFx,u1 = ntrace(∆ T x,u1 Σx∆x,u1 ) and nFy,u2 = ntrace(∆ T y,u2 Σy∆y,u2 ). Then, a proper value of u is set to have little changes in nFx,u1 and nFx,u1+1 and in nFy,u2 and nFy,u2+1. It is not necessary that the selected u1 and u2 for Mx,u1 and M 0 y,u2 are common. Next, the original two sets of variables of X and Y are replaced with M0 T x,u1 X ∈ R r and M0 T y,u2 Y ∈ R p. This reduction of X and Y does not cause any loss of information on CCA in sense that S(Mx,u1 ) = S(Mx) and S(My,u2 ) = S(My), and it is called initialized CCA. The initialized CCA has the following two cases. case 1: Suppose that min(p, r) = r << n. Then, the original X alone is replaced with M0 T x,u1 X and the original Y is kept. case 2: If min(p, r) = r is not fairly smaller than n, Σxy and Σyx","PeriodicalId":20974,"journal":{"name":"R J.","volume":"44 1","pages":"7"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"R J.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32614/rj-2021-026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Canonical correlation analysis (CCA) has a long history as an explanatory statistical method in high-dimensional data analysis and has been successfully applied in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. The so-called seedCCA is a newly developed R package, and it implements not only the standard and seeded CCA but also partial least squares. The package enables us to fit CCA to large-p and small-n data. The paper provides a complete guide. Also, the seeded CCA application results are compared with the regularized CCA in the existing R package. It is believed that the package along with the paper will contribute to highdimensional data analysis in various science field practitioners and that the statistical methodologies in multivariate analysis become more fruitful. Introduction Explanatory studies are important to identify patterns and special structure in data prior to develop a specific model. When a study between two sets of a p-dimensional random variables X (X ∈ Rp) and a r-dimensional random variable Y (Y ∈ Rr), are of primary interest, one of the popular explanatory statistical methods would be canonical correlation analysis (CCA; Hotelling (1936)). The main goal of CCA is the dimension reduction of two sets of variables by measuring an association between the two sets. For this, pairs of linear combinations of variables are constructed by maximizing the Pearson correlation. The CCA has successful application in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. In Lee and Yoo (2014) it is shown that the CCA can be used as a dimension reduction tool for high-dimensional data, but also it is connected to least square estimator. Therefore, the CCA is not only explanatory and dimension reduction method but also can be utilized as alternative of least square estimation. If max(p, r) is bigger than or equal to the sample size, n, usual CCA application is not plausible due to no incapability of inverting sample covariance matrices. To overcome this, a regularized CCA is developed by Leurgans et al. (1993), whose idea was firstly suggested in Vinod (1976). In practice, the CCA package by González et al. (2008) can implement a version of the regularized CCA. To make the sample covariance matrices saying Σ̂x and Σ̂y, invertible, in González et al. (2008), they are replaced with Σ̂ λ1 x = Σ̂x + λ1Ip and Σ̂ λ2 y = Σ̂y + λ1Ir. The optimal values of λ1 and λ2 are chosen by maximizing a cross-validation score throughout the two-dimensional grid search. Although it is discussed that a relatively small grid of reasonable values for λ1 and λ2 can lesson intensive computing in González et al. (2008), it is still time-consuming as observed in later sections. Additionally, fast regularized CCA and robust CCA via projection-pursuit are recently developed in Cruz-Cano (2012) and Alfons et al. (2016), respectively. Another version of CCA to handle max(p, r) > n is the so-called seeded canonical correlation analysis proposed by Im et al. (2014) Since the seeded CCA does not require any regularization procedure, which is computationally intensive, its implementation to larger data is quite fast. The seeded CCA requires two steps. In the initial step, a set of variables bigger than n is initially reduced based on iterative projections. In next step, the standard CCA is applied to two sets of variables acquired from the initial step to finalize CCA of data. Another advantage is that the procedure of the seeded CCA has a close relation with partial least square, which is one of the popular statistical methods for large p-small n data, so the seed CCA can yield the PLS estimates. The seedCCA package is recently developed mainly to implement the seeded CCA. However, the package can fit a collection of the statistical methodologies, which are standard canonical correlation and partial least squares with uni/multi-dimensional responses including the seeded CCA. The package has been already uploaded to CRAN (https://cran.r-project.org/web/packages/ seedCCA/index.html). The main goal of the paper is to introduce and illustrate the seedCCA package. For this, three real data are fitted by the standard CCA, the seeded CCA and partial least square, and two of the three data are available in the package. One of them has been analyzed in González et al. (2008). So, the The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 2 implementation results by the seeded and regularized CCA are closely compared. The organization of the paper is as follows. The collection of three methodologies is discussed in Section 2. The implementation of seedCCA is illustrated and compared with CCA in Section 3. In Section 4, we summarize the work. We will use the following notations throughout the rest of the paper. A p-dimensional random variable X will be denoted as X ∈ Rp. So, X ∈ Rp means a random variable, although there is no specific mention. For X ∈ Rp and Y ∈ Rr, we define that cov(X) = Σx, cov(Y) = Σy, cov(X, Y) = Σxy and cov(Y, X) = Σyx. And, it is assumed that Σx and Σy are positive-definite. Collection of implemented methodologies in seedCCA Canonical correlation analysis Suppose the two sets of variable X ∈ Rp and Y ∈ Rr and consider their linear combinations of U = aTX and V = bTY. Then we have var(U) = aΣxa, var(V) = bΣyb, and cov(U, V) = aΣxyb, where a ∈ Rp×1 and b ∈ Rr×1. Then Pearson-correlation between U and V is as follows: cor(U, V) = aΣxyb √ aTΣxa √ bTΣyb . (1) We seek to find a and b to maximize cor(U, V) with satisfying the following criteria. 1. The first canonical variate pair (U1 = a1 X, V1 = b T 1 Y) is obtained from maximizing (1). 2. The second canonical variate pair (U2 = a2 X, V2 = b T 2 Y) is constructed from the maximization of (1) with restriction that var(U2) = var(V2) = 1 and (U1, V1) and (U2, V2) are uncorrelated. 3. At the k step, the kth canonical variate pair (Uk = ak X, Vk = b T k Y) is obtained from the maximization of (1) with restriction that var(Uk) = var(Vk) = 1 and (Uk, Vk) are uncorrelated with the previous (k− 1) canonical variate pairs. 4. Repeat Steps 1 to 3 until k becomes q (= min(p, r)). 5. Select the first d pairs of (Uk, Vk) to represent the relationship between X and Y. Under this criteria, the pairs (ai, bi) are constructed as follows: ai = Σ−1/2 x ψi and bi = Σ −1/2 y φi for i = 1, . . . , q, where (ψ1, ..., ψq) and (φ1, ..., φq) are, respectively, the q eigenvectors of Σ−1/2 x ΣxyΣ −1 y ΣyxΣ −1/2 y and Σ−1/2 y ΣyxΣ −1 x ΣxyΣ −1/2 x with the corresponding common ordered-eigenvalues of ρ∗2 1 ≥ · · · ≥ ρ∗2 q ≥ 0. Then matrices of Mx = (a1, ..., ad) and My = (b1, ..., bd) are called canonical coefficient matrices for d = 1, ..., q. Also, Mx X and My Y are called canonical variates. In sample, the population quantities are replaced with their usual moment estimators. For more details regarding this standard CCA, readers may refer Johnson and Wichern (2007). Seeded canonical correlation analysis Since the standard CCA application requires the inversion of Σ̂x and Σ̂y in practice, it is not plausible for high-dimensional data with max(p, r) > n. In Im et al. (2014) a seeded canonical correlation analysis approach is proposed to overcome this deficit. The seeded CCA is a two step procedure, consisting of initialized and finalized steps. In the initialized step, the original two sets of variables are reduced to m-dimensional pairs without loss of information on the CCA application. In the initialized step, it is essential to force m << n. In the finalized step, the standard CCA is implemented to the initially-reduced pairs for the repairing and orthonormality. More detailed discussion on the seeded CCA is as follows in next subsections. Development Define a notation of S(M) as the subspace spanned by the columns of M ∈ Rp×r . Lee and Yoo (2014) show the following relation: S(Mx) = S(Σ−1 x Σxy) and S(My) = S(Σ−1 y Σyx). (2) The relation in (2) directly indicates that Mx and My form basis matrices of S(Σ−1 x Σxy) and S(Σ−1 y Σyx) and that Mx and My can be restored from Σ−1 x Σxy and Σ −1 y Σyx. The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 3 Now, we define the following two matrices: Rx,u1 ∈ Rp×ru1 = (Σxy, ΣxΣxy, . . . , Σu1−1 x Σxy) and Ry,u2 ∈ Rr×pu2 = (Σyx, ΣyΣyx, . . . , Σu2−1 y Σyx). (3) In Rx,u1 and Ry,u2 , the numbers of u1 and u2 are called termination indexes, and they decide the number of projections of Σxy and Σyx onto Σx and Σy, respectively. Also define that Mx,u1 ∈ R p×r = Rx,u1 (R T x,u1 ΣxRx,u1 ) Rx,u1 Σxy and My,u2 ∈ R r×p = Ry,u2 (R T y,u2 ΣyRy,u2 ) Ry,u2 Σyx. (4) In Cook et al. (2007) it is shown that S(Mx,u1 ) = S(Σ −1 x Σxy) and S(My,u2 ) = S(Σ −1 y Σyx) in (4), and hence Mx,u1 and M 0 y,u2 can be used to infer Mx and My, respectively. One clear advantage to use M 0 x,u1 and My,u2 is no need of the inversion of Σx and Σy. Practically, it is important to select proper values for the termination indexes u1 and u2. For this define that ∆x,u1 = M 0 x,u1+1 −M 0 x,u1 and ∆y,u2 = M 0 y,u2+1 −M 0 y,u2 . Finally, the following measure for increment of u1 and u2 is defined: nFx,u1 = ntrace(∆ T x,u1 Σx∆x,u1 ) and nFy,u2 = ntrace(∆ T y,u2 Σy∆y,u2 ). Then, a proper value of u is set to have little changes in nFx,u1 and nFx,u1+1 and in nFy,u2 and nFy,u2+1. It is not necessary that the selected u1 and u2 for Mx,u1 and M 0 y,u2 are common. Next, the original two sets of variables of X and Y are replaced with M0 T x,u1 X ∈ R r and M0 T y,u2 Y ∈ R p. This reduction of X and Y does not cause any loss of information on CCA in sense that S(Mx,u1 ) = S(Mx) and S(My,u2 ) = S(My), and it is called initialized CCA. The initialized CCA has the following two cases. case 1: Suppose that min(p, r) = r << n. Then, the original X alone is replaced with M0 T x,u1 X and the original Y is kept. case 2: If min(p, r) = r is not fairly smaller than n, Σxy and Σyx