F. Greselin, T. B. Murphy, G. C. Porzio, D. Vistocco
This special issue of Statistical Analysis and Data Mining collects papers presented at the 12th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS), held in Cassino, Italy, 11–13 September 2019. The CLADAG group, founded in 1997, promotes advanced methodological research in multivariate statistics with a special vocation in Data Analysis and Classification. CLADAG is a member of the International Federation of Classification Societies (IFCS). It organizes a biennial international scientific meeting, schools related to classification and data analysis, publishes a newsletter, and cooperates with other member societies of the IFCS to the organization of their conferences. Founded in 1985, the IFCS is a federation of national, regional, and linguistically-based classification societies aimed at promoting classification research. Previous CLADAG meetings were held in Pescara (1997), Roma (1999), Palermo (2001), Bologna (2003), Parma (2005), Macerata (2007), Catania (2009), Pavia (2011), Modena and Reggio Emilia (2013), Cagliari (2015), and Milano (2017). Best papers from the conference have been submitted to this special issue, and six of them have been selected for publication, following a blind peer-review process. The manuscripts deal with different data analysis issues: mixture of distributions, compositional data analysis, Markov chain for web usability, survival analysis, and applications to high-throughput, eye-tracking, and insurance transaction data. The paper by Jirí Dvorák et al. (available in Stat Anal Data Min: The ASA Data Sci Journal. 2020;13:548–564) introduces the Clover plot, an easy-to-understand graphical tool that facilitates the appropriate choice of a classifier, to be employed in supervised classification. It combines four complementary classifiers—the depth–depth plot, the bagdistance plot, an approach based on the illumination, and the classical diagnostic plot based on Mahalanobis distances. It borrows strengths from all these methodologies, contrasts them, and allows interpretations about the structure of the data. The paper by S.X. Lee et al. proposes a parallelization strategy of the Expectation–Maximization (EM) algorithm, with a special focus on the estimation of finite mixtures of flexible distribution such as the canonical fundamental skew t distribution (CFUST). The parallel implementation of the EM-algorithm is suitable for single-threaded and multi-threaded processors as well as for single machine and multiple-node systems. The EM algorithm is also discussed in the paper of L. Scrucca. Here, a fast and efficient Modal EM algorithm for identifying the modes of a density estimated through a finite mixture of Gaussian distributions with parsimonious component covariance structures is provided. The proposed approach is based on an iterative procedure aimed at identifying the local maxima, exploiting features of the underlying Gaussian mixture model. Motiv
本期《统计分析与数据挖掘》特刊收集了2019年9月11日至13日在意大利卡西诺举行的意大利统计学会(SIS)分类和数据分析小组(CLADAG)第12届科学会议上发表的论文。CLADAG集团成立于1997年,致力于推动多元统计领域的先进方法研究,并致力于数据分析与分类。CLADAG是国际船级社联合会(IFCS)的成员。它每两年组织一次国际科学会议,与分类和数据分析有关的学校,出版一份通讯,并与IFCS的其他成员协会合作组织会议。IFCS成立于1985年,是一个旨在促进分类研究的国家、地区和语言分类协会联合会。此前的CLADAG会议在佩斯卡拉(1997年)、罗马(1999年)、巴勒莫(2001年)、博洛尼亚(2003年)、帕尔马(2005年)、马切拉塔(2007年)、卡塔尼亚(2009年)、帕维亚(2011年)、摩德纳和雷吉欧艾米利亚(2013年)、卡利亚里(2015年)和米兰(2017年)举行。本次会议的最佳论文已被提交给本期特刊,其中六篇论文已被选中发表,这是经过同行盲评议的过程。这些手稿涉及不同的数据分析问题:混合分布、组合数据分析、网络可用性的马尔可夫链、生存分析,以及对高吞吐量、眼球追踪和保险交易数据的应用。Jirí Dvorák等人的论文(可在Stat Anal Data Min: The ASA Data Sci Journal. 2020; 13:548-564中获得)介绍了Clover plot,这是一种易于理解的图形工具,有助于在监督分类中使用分类器的适当选择。它结合了四种互补的分类器——深度-深度图、袋距图、基于光照的方法和基于马氏距离的经典诊断图。它借鉴了所有这些方法的优点,对它们进行对比,并允许对数据结构进行解释。S.X. Lee等人的论文提出了一种期望最大化(EM)算法的并行化策略,特别关注灵活分布(如典型基本倾斜t分布(CFUST))的有限混合估计。em算法的并行实现适用于单线程和多线程处理器以及单机和多节点系统。L. scucca的论文也讨论了EM算法。本文提出了一种快速有效的模态EM算法,用于识别由有限的高斯分布混合估计的密度的模态。提出的方法是基于一个迭代过程,旨在识别局部最大值,利用底层高斯混合模型的特征。受高通量组合数据分析应用的启发,N. Štefelová等人的论文提出了一种数据驱动的加权策略,通过组合预测因子的PLS回归来增强标记识别。加权策略利用了响应变量与成对对数比之间的相关结构。通过对与牛排放温室气体有关的代谢物信号的分析,可以说明其实际意义。G. Zammarchi等人的论文利用马尔可夫链来分析使用眼动追踪方法的大学网站的网络可用性。为了提高其可用性,本文比较了高中生和大学生在十项不同任务的完成时间、注视数和难度比方面的表现。相反,D. Zapletal利用捷克共和国一家商业保险公司的数据来比较保险交易框架内一些生存分析模型的有效性。通过Cox比例风险模型和一些相互竞争的风险模型(即,特定原因风险模型和子分布风险模型)识别相关解释变量的能力在由20多万人组成的大型数据集上进行了评估。总之,这个特刊符合CLADAG支持分类和数据分析思想交流的目标。我们坚信它很好地代表了科学的特点
{"title":"CLADAG 2019 Special Issue: Selected Papers on Classification and Data Analysis","authors":"F. Greselin, T. B. Murphy, G. C. Porzio, D. Vistocco","doi":"10.1002/sam.11533","DOIUrl":"https://doi.org/10.1002/sam.11533","url":null,"abstract":"This special issue of Statistical Analysis and Data Mining collects papers presented at the 12th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS), held in Cassino, Italy, 11–13 September 2019. The CLADAG group, founded in 1997, promotes advanced methodological research in multivariate statistics with a special vocation in Data Analysis and Classification. CLADAG is a member of the International Federation of Classification Societies (IFCS). It organizes a biennial international scientific meeting, schools related to classification and data analysis, publishes a newsletter, and cooperates with other member societies of the IFCS to the organization of their conferences. Founded in 1985, the IFCS is a federation of national, regional, and linguistically-based classification societies aimed at promoting classification research. Previous CLADAG meetings were held in Pescara (1997), Roma (1999), Palermo (2001), Bologna (2003), Parma (2005), Macerata (2007), Catania (2009), Pavia (2011), Modena and Reggio Emilia (2013), Cagliari (2015), and Milano (2017). Best papers from the conference have been submitted to this special issue, and six of them have been selected for publication, following a blind peer-review process. The manuscripts deal with different data analysis issues: mixture of distributions, compositional data analysis, Markov chain for web usability, survival analysis, and applications to high-throughput, eye-tracking, and insurance transaction data. The paper by Jirí Dvorák et al. (available in Stat Anal Data Min: The ASA Data Sci Journal. 2020;13:548–564) introduces the Clover plot, an easy-to-understand graphical tool that facilitates the appropriate choice of a classifier, to be employed in supervised classification. It combines four complementary classifiers—the depth–depth plot, the bagdistance plot, an approach based on the illumination, and the classical diagnostic plot based on Mahalanobis distances. It borrows strengths from all these methodologies, contrasts them, and allows interpretations about the structure of the data. The paper by S.X. Lee et al. proposes a parallelization strategy of the Expectation–Maximization (EM) algorithm, with a special focus on the estimation of finite mixtures of flexible distribution such as the canonical fundamental skew t distribution (CFUST). The parallel implementation of the EM-algorithm is suitable for single-threaded and multi-threaded processors as well as for single machine and multiple-node systems. The EM algorithm is also discussed in the paper of L. Scrucca. Here, a fast and efficient Modal EM algorithm for identifying the modes of a density estimated through a finite mixture of Gaussian distributions with parsimonious component covariance structures is provided. The proposed approach is based on an iterative procedure aimed at identifying the local maxima, exploiting features of the underlying Gaussian mixture model. Motiv","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115314784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A commercial insurance company in the Czech Republic provided data on critical illness insurance. The survival analysis was used to study the influence of the gender of an insured person, the age at which the person entered into an insurance contract, and the region where the insured person lived on the occurrence of an insured event. The main goal of the research was to investigate whether the influence of explanatory variables is estimated differently when two different approaches of analysis are used. The two approaches used were (1) the Cox proportional hazard model that does not assign a specific cause, such as a certain diagnosis, to a critical illness insured event and (2) the competing risks models. Regression models related to these approaches were estimated by R software. The results, which are discussed and compared in the paper, show that insurance companies might benefit from offering policies that consider specific diagnoses as the cause of insured events. They also show that in addition to age, the gender of the client plays a key role in the occurrence of such insured events.
{"title":"Application of the Cox proportional hazards model and competing risks models to critical illness insurance data","authors":"David Zapletal","doi":"10.1002/sam.11532","DOIUrl":"https://doi.org/10.1002/sam.11532","url":null,"abstract":"A commercial insurance company in the Czech Republic provided data on critical illness insurance. The survival analysis was used to study the influence of the gender of an insured person, the age at which the person entered into an insurance contract, and the region where the insured person lived on the occurrence of an insured event. The main goal of the research was to investigate whether the influence of explanatory variables is estimated differently when two different approaches of analysis are used. The two approaches used were (1) the Cox proportional hazard model that does not assign a specific cause, such as a certain diagnosis, to a critical illness insured event and (2) the competing risks models. Regression models related to these approaches were estimated by R software. The results, which are discussed and compared in the paper, show that insurance companies might benefit from offering policies that consider specific diagnoses as the cause of insured events. They also show that in addition to age, the gender of the client plays a key role in the occurrence of such insured events.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124674974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hierarchical and k‐medoids clustering are deterministic clustering algorithms defined on pairwise distances. We use these same pairwise distances in a novel stochastic clustering procedure based on a probability distribution. We call our proposed method CaviarPD, a portmanteau from cluster analysis via random partition distributions. CaviarPD first samples clusterings from a distribution on partitions and then finds the best cluster estimate based on these samples using algorithms to minimize an expected loss. Using eight case studies, we show that our approach produces results as close to the truth as hierarchical and k‐medoids methods, and has the additional advantage of allowing for a probabilistic framework to assess clustering uncertainty. The method provides an intuitive graphical representation of clustering uncertainty through pairwise probabilities from partition samples. A software implementation of the method is available in the CaviarPD package for R.
{"title":"Cluster analysis via random partition distributions","authors":"D. B. Dahl, J. Andros, J. Carter","doi":"10.1002/sam.11602","DOIUrl":"https://doi.org/10.1002/sam.11602","url":null,"abstract":"Hierarchical and k‐medoids clustering are deterministic clustering algorithms defined on pairwise distances. We use these same pairwise distances in a novel stochastic clustering procedure based on a probability distribution. We call our proposed method CaviarPD, a portmanteau from cluster analysis via random partition distributions. CaviarPD first samples clusterings from a distribution on partitions and then finds the best cluster estimate based on these samples using algorithms to minimize an expected loss. Using eight case studies, we show that our approach produces results as close to the truth as hierarchical and k‐medoids methods, and has the additional advantage of allowing for a probabilistic framework to assess clustering uncertainty. The method provides an intuitive graphical representation of clustering uncertainty through pairwise probabilities from partition samples. A software implementation of the method is available in the CaviarPD package for R.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121976658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Finite mixture models are powerful tools for modeling and analyzing heterogeneous data. Parameter estimation is typically carried out using maximum likelihood estimation via the Expectation–Maximization (EM) algorithm. Recently, the adoption of flexible distributions as component densities has become increasingly popular. Often, the EM algorithm for these models involves complicated expressions that are time‐consuming to evaluate numerically. In this paper, we describe a parallel implementation of the EM algorithm suitable for both single‐threaded and multi‐threaded processors and for both single machine and multiple‐node systems. Numerical experiments are performed to demonstrate the potential performance gain in different settings. Comparison is also made across two commonly used platforms—R and MATLAB. For illustration, a fairly general mixture model is used in the comparison.
{"title":"Multi‐node Expectation–Maximization algorithm for finite mixture models","authors":"Sharon X. Lee, G. McLachlan, Kaleb L. Leemaqz","doi":"10.1002/sam.11529","DOIUrl":"https://doi.org/10.1002/sam.11529","url":null,"abstract":"Finite mixture models are powerful tools for modeling and analyzing heterogeneous data. Parameter estimation is typically carried out using maximum likelihood estimation via the Expectation–Maximization (EM) algorithm. Recently, the adoption of flexible distributions as component densities has become increasingly popular. Often, the EM algorithm for these models involves complicated expressions that are time‐consuming to evaluate numerically. In this paper, we describe a parallel implementation of the EM algorithm suitable for both single‐threaded and multi‐threaded processors and for both single machine and multiple‐node systems. Numerical experiments are performed to demonstrate the potential performance gain in different settings. Comparison is also made across two commonly used platforms—R and MATLAB. For illustration, a fairly general mixture model is used in the comparison.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132083052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose tractable symmetric exponential families of distributions for multivariate vectors of 0's and 1's in p dimensions, or what are referred to in this paper as binary vectors, that allow for nontrivial amounts of variation around some central value μ∈{0,1}p . We note that more or less standard asymptotics provides likelihood‐based inference in the one‐sample problem. We then consider mixture models where component distributions are of this form. Bayes analysis based on Dirichlet processes and Jeffreys priors for the exponential family parameters prove tractable and informative in problems where relevant distributions for a vector of binary variables are clearly not symmetric. We also extend our proposed Bayesian mixture model analysis to datasets with missing entries. Performance is illustrated through simulation studies and application to real datasets.
{"title":"Modeling and inference for mixtures of simple symmetric exponential families of p ‐dimensional distributions for vectors with binary coordinates","authors":"A. Chakraborty, S. Vardeman","doi":"10.1002/sam.11528","DOIUrl":"https://doi.org/10.1002/sam.11528","url":null,"abstract":"We propose tractable symmetric exponential families of distributions for multivariate vectors of 0's and 1's in p dimensions, or what are referred to in this paper as binary vectors, that allow for nontrivial amounts of variation around some central value μ∈{0,1}p . We note that more or less standard asymptotics provides likelihood‐based inference in the one‐sample problem. We then consider mixture models where component distributions are of this form. Bayes analysis based on Dirichlet processes and Jeffreys priors for the exponential family parameters prove tractable and informative in problems where relevant distributions for a vector of binary variables are clearly not symmetric. We also extend our proposed Bayesian mixture model analysis to datasets with missing entries. Performance is illustrated through simulation studies and application to real datasets.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131970853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the article “Data-driven dimension reduction in functional principal component analysis identifying the change-point in functional data” published in the Statistical Analysis and Data Mining: The ASA Data Science Journal Vol. 13, No. 6, p. 535, the following sentence is added in the Acknowledgements section after the first online publication. “The research of the third author Mr. Arjun Lakra is supported by a grant from Council of Scientific and Industrial Research (CSIR Award No.: 09/081(1350)/2019-EMR-I), Government of India.” We apologize for this error.
{"title":"Erratum to “Data‐driven dimension reduction in functional principal component analysis identifying the change‐point in functional data”","authors":"","doi":"10.1002/sam.11510","DOIUrl":"https://doi.org/10.1002/sam.11510","url":null,"abstract":"In the article “Data-driven dimension reduction in functional principal component analysis identifying the change-point in functional data” published in the Statistical Analysis and Data Mining: The ASA Data Science Journal Vol. 13, No. 6, p. 535, the following sentence is added in the Acknowledgements section after the first online publication. “The research of the third author Mr. Arjun Lakra is supported by a grant from Council of Scientific and Industrial Research (CSIR Award No.: 09/081(1350)/2019-EMR-I), Government of India.” We apologize for this error.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133109731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amanda Muyskens, Kathleen L. Schmidt, Matthew D. Nelms, N. Barton, J. Florando, A. Kupresanin, David Rivera
In regimes of high strain rate, the strength of materials often cannot be measured directly in experiments. Instead, the strength is inferred based on an experimental observable, such as a change in shape, that is matched by simulations supported by a known strength model. In hole closure experiments, the rate and degree to which a central hole in a plate of material closes during a dynamic loading event are used to infer material strength parameters. Due to the complexity of the experiment, many computationally expensive, three‐dimensional simulations are necessary to train an emulator for calibration or other analyses. These simulations can be run at multiple grid resolutions, where dense grids are slower but more accurate. In an effort to reduce the computational cost, a combination of simulations with different resolutions can be combined to develop an accurate emulator within a limited training time. We explore the novel design and construction of an appropriate functional recursive multi‐fidelity emulator of a strength model for tantalum in hole closure experiments that can be applied to arbitrarily large training data. Hence, by formulating a multi‐fidelity model to employ low‐fidelity simulations, we were able to reduce the error of our emulator by approximately 81% with only an approximately 1.6% increase in computing resource utilization.
{"title":"A practical extension of the recursive multi‐fidelity model for the emulation of hole closure experiments","authors":"Amanda Muyskens, Kathleen L. Schmidt, Matthew D. Nelms, N. Barton, J. Florando, A. Kupresanin, David Rivera","doi":"10.1002/sam.11513","DOIUrl":"https://doi.org/10.1002/sam.11513","url":null,"abstract":"In regimes of high strain rate, the strength of materials often cannot be measured directly in experiments. Instead, the strength is inferred based on an experimental observable, such as a change in shape, that is matched by simulations supported by a known strength model. In hole closure experiments, the rate and degree to which a central hole in a plate of material closes during a dynamic loading event are used to infer material strength parameters. Due to the complexity of the experiment, many computationally expensive, three‐dimensional simulations are necessary to train an emulator for calibration or other analyses. These simulations can be run at multiple grid resolutions, where dense grids are slower but more accurate. In an effort to reduce the computational cost, a combination of simulations with different resolutions can be combined to develop an accurate emulator within a limited training time. We explore the novel design and construction of an appropriate functional recursive multi‐fidelity emulator of a strength model for tantalum in hole closure experiments that can be applied to arbitrarily large training data. Hence, by formulating a multi‐fidelity model to employ low‐fidelity simulations, we were able to reduce the error of our emulator by approximately 81% with only an approximately 1.6% increase in computing resource utilization.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116804886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High‐throughput data representing large mixtures of chemical or biological signals are ordinarily produced in the molecular sciences. Given a number of samples, partial least squares (PLS) regression is a well‐established statistical method to investigate associations between them and any continuous response variables of interest. However, technical artifacts generally make the raw signals not directly comparable between samples. Thus, data normalization is required before any meaningful scientific information can be drawn. This often allows to characterize the processed signals as compositional data where the relevant information is contained in the pairwise log‐ratios between the components of the mixture. The (log‐ratio) pivot coordinate approach facilitates the aggregation into single variables of the pairwise log‐ratios of a component to all the remaining components. This simplifies interpretability and the investigation of their relative importance but, particularly in a high‐dimensional context, the aggregated log‐ratios can easily mix up information from different underlaying processes. In this context, we propose a weighting strategy for the construction of pivot coordinates for PLS regression which draws on the correlation between response variable and pairwise log‐ratios. Using real and simulated data sets, we demonstrate that this proposal enhances the discovery of biological markers in high‐throughput compositional data.
{"title":"Weighted pivot coordinates for partial least squares‐based marker discovery in high‐throughput compositional data","authors":"N. Štefelová, J. Palarea‐Albaladejo, K. Hron","doi":"10.1002/sam.11514","DOIUrl":"https://doi.org/10.1002/sam.11514","url":null,"abstract":"High‐throughput data representing large mixtures of chemical or biological signals are ordinarily produced in the molecular sciences. Given a number of samples, partial least squares (PLS) regression is a well‐established statistical method to investigate associations between them and any continuous response variables of interest. However, technical artifacts generally make the raw signals not directly comparable between samples. Thus, data normalization is required before any meaningful scientific information can be drawn. This often allows to characterize the processed signals as compositional data where the relevant information is contained in the pairwise log‐ratios between the components of the mixture. The (log‐ratio) pivot coordinate approach facilitates the aggregation into single variables of the pairwise log‐ratios of a component to all the remaining components. This simplifies interpretability and the investigation of their relative importance but, particularly in a high‐dimensional context, the aggregated log‐ratios can easily mix up information from different underlaying processes. In this context, we propose a weighting strategy for the construction of pivot coordinates for PLS regression which draws on the correlation between response variable and pairwise log‐ratios. Using real and simulated data sets, we demonstrate that this proposal enhances the discovery of biological markers in high‐throughput compositional data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130172352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bernard Nguyen, Leanne S. Whitmore, Anthe George, Corey M. Hudson
In‐silico screening of novel biofuel molecules based on chemical and fuel properties is a critical first step in the biofuel evaluation process due to the significant volumes of samples required for experimental testing, the destructive nature of engine tests, and the costs associated with bench‐scale synthesis of novel fuels. Predictive models are limited by training sets of few existing measurements, often containing similar classes of molecules that represent just a subset of the potential molecular fuel space. Software tools can be used to generate every possible molecular descriptor for use as input features, but most of these features are largely irrelevant and training models on datasets with higher dimensionality than size tends to yield poor predictive performance. Feature selection has been shown to improve machine learning models, but correlation‐based feature selection fails to provide scientific insight into the underlying mechanisms that determine structure–property relationships. The implementation of causal discovery in feature selection could potentially inform the biofuel design process while also improving model prediction accuracy and robustness to new data. In this study, we investigate the benefits causal‐based feature selection might have on both model performance and identification of key molecular substructures. We found that causal‐based feature selection performed on par with alternative filtration methods, and that a structural causal model provides valuable scientific insights into the relationships between molecular substructures and fuel properties.
{"title":"Evaluating causal‐based feature selection for fuel property prediction models","authors":"Bernard Nguyen, Leanne S. Whitmore, Anthe George, Corey M. Hudson","doi":"10.1002/sam.11511","DOIUrl":"https://doi.org/10.1002/sam.11511","url":null,"abstract":"In‐silico screening of novel biofuel molecules based on chemical and fuel properties is a critical first step in the biofuel evaluation process due to the significant volumes of samples required for experimental testing, the destructive nature of engine tests, and the costs associated with bench‐scale synthesis of novel fuels. Predictive models are limited by training sets of few existing measurements, often containing similar classes of molecules that represent just a subset of the potential molecular fuel space. Software tools can be used to generate every possible molecular descriptor for use as input features, but most of these features are largely irrelevant and training models on datasets with higher dimensionality than size tends to yield poor predictive performance. Feature selection has been shown to improve machine learning models, but correlation‐based feature selection fails to provide scientific insight into the underlying mechanisms that determine structure–property relationships. The implementation of causal discovery in feature selection could potentially inform the biofuel design process while also improving model prediction accuracy and robustness to new data. In this study, we investigate the benefits causal‐based feature selection might have on both model performance and identification of key molecular substructures. We found that causal‐based feature selection performed on par with alternative filtration methods, and that a structural causal model provides valuable scientific insights into the relationships between molecular substructures and fuel properties.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121057193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Web usability is a crucial feature of a website, allowing users to easily find information in a short time. Eye tracking data registered during the execution of tasks allow to measure web usability in a more objective way compared to questionnaires. In this work, we evaluated the web usability of the website of the University of Cagliari through the analysis of eye tracking data with qualitative and quantitative methods. Performances of two groups of students (i.e., high school and university students) across 10 different tasks were compared in terms of time to completion, number of fixations and difficulty ratio. Transitions between different areas of interest (AOI) were analyzed in the two groups using Markov chain. For the majority of tasks, we did not observe significant differences in the performances of the two groups, suggesting that the information needed to complete the tasks could easily be retrieved by students with little previous experience in using the website. For a specific task, high school students showed a worse performance based on the number of fixations and a different Markov chain stationary distribution compared to university students. These results allowed to highlight elements of the pages that can be modified to improve web usability.
{"title":"Markov chain to analyze web usability of a university website using eye tracking data","authors":"Gianpaolo Zammarchi, L. Frigau, F. Mola","doi":"10.1002/sam.11512","DOIUrl":"https://doi.org/10.1002/sam.11512","url":null,"abstract":"Web usability is a crucial feature of a website, allowing users to easily find information in a short time. Eye tracking data registered during the execution of tasks allow to measure web usability in a more objective way compared to questionnaires. In this work, we evaluated the web usability of the website of the University of Cagliari through the analysis of eye tracking data with qualitative and quantitative methods. Performances of two groups of students (i.e., high school and university students) across 10 different tasks were compared in terms of time to completion, number of fixations and difficulty ratio. Transitions between different areas of interest (AOI) were analyzed in the two groups using Markov chain. For the majority of tasks, we did not observe significant differences in the performances of the two groups, suggesting that the information needed to complete the tasks could easily be retrieved by students with little previous experience in using the website. For a specific task, high school students showed a worse performance based on the number of fixations and a different Markov chain stationary distribution compared to university students. These results allowed to highlight elements of the pages that can be modified to improve web usability.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130065955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}