Statistical Analysis and Data Mining: The ASA Data Science Journal最新文献

英文中文

CLADAG 2019 Special Issue: Selected Papers on Classification and Data Analysis CLADAG 2019特刊:分类与数据分析论文选集

Statistical Analysis and Data Mining: The ASA Data Science Journal

Pub Date : 2021-06-16 DOI: 10.1002/sam.11533

F. Greselin, T. B. Murphy, G. C. Porzio, D. Vistocco

This special issue of Statistical Analysis and Data Mining collects papers presented at the 12th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS), held in Cassino, Italy, 11–13 September 2019. The CLADAG group, founded in 1997, promotes advanced methodological research in multivariate statistics with a special vocation in Data Analysis and Classification. CLADAG is a member of the International Federation of Classification Societies (IFCS). It organizes a biennial international scientific meeting, schools related to classification and data analysis, publishes a newsletter, and cooperates with other member societies of the IFCS to the organization of their conferences. Founded in 1985, the IFCS is a federation of national, regional, and linguistically-based classification societies aimed at promoting classification research. Previous CLADAG meetings were held in Pescara (1997), Roma (1999), Palermo (2001), Bologna (2003), Parma (2005), Macerata (2007), Catania (2009), Pavia (2011), Modena and Reggio Emilia (2013), Cagliari (2015), and Milano (2017). Best papers from the conference have been submitted to this special issue, and six of them have been selected for publication, following a blind peer-review process. The manuscripts deal with different data analysis issues: mixture of distributions, compositional data analysis, Markov chain for web usability, survival analysis, and applications to high-throughput, eye-tracking, and insurance transaction data. The paper by Jirí Dvorák et al. (available in Stat Anal Data Min: The ASA Data Sci Journal. 2020;13:548–564) introduces the Clover plot, an easy-to-understand graphical tool that facilitates the appropriate choice of a classifier, to be employed in supervised classification. It combines four complementary classifiers—the depth–depth plot, the bagdistance plot, an approach based on the illumination, and the classical diagnostic plot based on Mahalanobis distances. It borrows strengths from all these methodologies, contrasts them, and allows interpretations about the structure of the data. The paper by S.X. Lee et al. proposes a parallelization strategy of the Expectation–Maximization (EM) algorithm, with a special focus on the estimation of finite mixtures of flexible distribution such as the canonical fundamental skew t distribution (CFUST). The parallel implementation of the EM-algorithm is suitable for single-threaded and multi-threaded processors as well as for single machine and multiple-node systems. The EM algorithm is also discussed in the paper of L. Scrucca. Here, a fast and efficient Modal EM algorithm for identifying the modes of a density estimated through a finite mixture of Gaussian distributions with parsimonious component covariance structures is provided. The proposed approach is based on an iterative procedure aimed at identifying the local maxima, exploiting features of the underlying Gaussian mixture model. Motiv

本期《统计分析与数据挖掘》特刊收集了2019年9月11日至13日在意大利卡西诺举行的意大利统计学会(SIS)分类和数据分析小组(CLADAG)第12届科学会议上发表的论文。CLADAG集团成立于1997年，致力于推动多元统计领域的先进方法研究，并致力于数据分析与分类。CLADAG是国际船级社联合会(IFCS)的成员。它每两年组织一次国际科学会议，与分类和数据分析有关的学校，出版一份通讯，并与IFCS的其他成员协会合作组织会议。IFCS成立于1985年，是一个旨在促进分类研究的国家、地区和语言分类协会联合会。此前的CLADAG会议在佩斯卡拉(1997年)、罗马(1999年)、巴勒莫(2001年)、博洛尼亚(2003年)、帕尔马(2005年)、马切拉塔(2007年)、卡塔尼亚(2009年)、帕维亚(2011年)、摩德纳和雷吉欧艾米利亚(2013年)、卡利亚里(2015年)和米兰(2017年)举行。本次会议的最佳论文已被提交给本期特刊，其中六篇论文已被选中发表，这是经过同行盲评议的过程。这些手稿涉及不同的数据分析问题:混合分布、组合数据分析、网络可用性的马尔可夫链、生存分析，以及对高吞吐量、眼球追踪和保险交易数据的应用。Jirí Dvorák等人的论文(可在Stat Anal Data Min: The ASA Data Sci Journal. 2020; 13:548-564中获得)介绍了Clover plot，这是一种易于理解的图形工具，有助于在监督分类中使用分类器的适当选择。它结合了四种互补的分类器——深度-深度图、袋距图、基于光照的方法和基于马氏距离的经典诊断图。它借鉴了所有这些方法的优点，对它们进行对比，并允许对数据结构进行解释。S.X. Lee等人的论文提出了一种期望最大化(EM)算法的并行化策略，特别关注灵活分布(如典型基本倾斜t分布(CFUST))的有限混合估计。em算法的并行实现适用于单线程和多线程处理器以及单机和多节点系统。L. scucca的论文也讨论了EM算法。本文提出了一种快速有效的模态EM算法，用于识别由有限的高斯分布混合估计的密度的模态。提出的方法是基于一个迭代过程，旨在识别局部最大值，利用底层高斯混合模型的特征。受高通量组合数据分析应用的启发，N. Štefelová等人的论文提出了一种数据驱动的加权策略，通过组合预测因子的PLS回归来增强标记识别。加权策略利用了响应变量与成对对数比之间的相关结构。通过对与牛排放温室气体有关的代谢物信号的分析，可以说明其实际意义。G. Zammarchi等人的论文利用马尔可夫链来分析使用眼动追踪方法的大学网站的网络可用性。为了提高其可用性，本文比较了高中生和大学生在十项不同任务的完成时间、注视数和难度比方面的表现。相反，D. Zapletal利用捷克共和国一家商业保险公司的数据来比较保险交易框架内一些生存分析模型的有效性。通过Cox比例风险模型和一些相互竞争的风险模型(即，特定原因风险模型和子分布风险模型)识别相关解释变量的能力在由20多万人组成的大型数据集上进行了评估。总之，这个特刊符合CLADAG支持分类和数据分析思想交流的目标。我们坚信它很好地代表了科学的特点

{"title":"CLADAG 2019 Special Issue: Selected Papers on Classification and Data Analysis","authors":"F. Greselin, T. B. Murphy, G. C. Porzio, D. Vistocco","doi":"10.1002/sam.11533","DOIUrl":"https://doi.org/10.1002/sam.11533","url":null,"abstract":"This special issue of Statistical Analysis and Data Mining collects papers presented at the 12th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS), held in Cassino, Italy, 11–13 September 2019. The CLADAG group, founded in 1997, promotes advanced methodological research in multivariate statistics with a special vocation in Data Analysis and Classification. CLADAG is a member of the International Federation of Classification Societies (IFCS). It organizes a biennial international scientific meeting, schools related to classification and data analysis, publishes a newsletter, and cooperates with other member societies of the IFCS to the organization of their conferences. Founded in 1985, the IFCS is a federation of national, regional, and linguistically-based classification societies aimed at promoting classification research. Previous CLADAG meetings were held in Pescara (1997), Roma (1999), Palermo (2001), Bologna (2003), Parma (2005), Macerata (2007), Catania (2009), Pavia (2011), Modena and Reggio Emilia (2013), Cagliari (2015), and Milano (2017). Best papers from the conference have been submitted to this special issue, and six of them have been selected for publication, following a blind peer-review process. The manuscripts deal with different data analysis issues: mixture of distributions, compositional data analysis, Markov chain for web usability, survival analysis, and applications to high-throughput, eye-tracking, and insurance transaction data. The paper by Jirí Dvorák et al. (available in Stat Anal Data Min: The ASA Data Sci Journal. 2020;13:548–564) introduces the Clover plot, an easy-to-understand graphical tool that facilitates the appropriate choice of a classifier, to be employed in supervised classification. It combines four complementary classifiers—the depth–depth plot, the bagdistance plot, an approach based on the illumination, and the classical diagnostic plot based on Mahalanobis distances. It borrows strengths from all these methodologies, contrasts them, and allows interpretations about the structure of the data. The paper by S.X. Lee et al. proposes a parallelization strategy of the Expectation–Maximization (EM) algorithm, with a special focus on the estimation of finite mixtures of flexible distribution such as the canonical fundamental skew t distribution (CFUST). The parallel implementation of the EM-algorithm is suitable for single-threaded and multi-threaded processors as well as for single machine and multiple-node systems. The EM algorithm is also discussed in the paper of L. Scrucca. Here, a fast and efficient Modal EM algorithm for identifying the modes of a density estimated through a finite mixture of Gaussian distributions with parsimonious component covariance structures is provided. The proposed approach is based on an iterative procedure aimed at identifying the local maxima, exploiting features of the underlying Gaussian mixture model. Motiv","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115314784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Application of the Cox proportional hazards model and competing risks models to critical illness insurance data Cox比例风险模型和竞争风险模型在重大疾病保险数据中的应用

Statistical Analysis and Data Mining: The ASA Data Science Journal

Pub Date : 2021-06-10 DOI: 10.1002/sam.11532

David Zapletal

A commercial insurance company in the Czech Republic provided data on critical illness insurance. The survival analysis was used to study the influence of the gender of an insured person, the age at which the person entered into an insurance contract, and the region where the insured person lived on the occurrence of an insured event. The main goal of the research was to investigate whether the influence of explanatory variables is estimated differently when two different approaches of analysis are used. The two approaches used were (1) the Cox proportional hazard model that does not assign a specific cause, such as a certain diagnosis, to a critical illness insured event and (2) the competing risks models. Regression models related to these approaches were estimated by R software. The results, which are discussed and compared in the paper, show that insurance companies might benefit from offering policies that consider specific diagnoses as the cause of insured events. They also show that in addition to age, the gender of the client plays a key role in the occurrence of such insured events.

捷克共和国的一家商业保险公司提供了关于重大疾病保险的数据。生存分析用于研究被保险人的性别、被保险人签订保险合同的年龄和被保险人居住的地区对保险事件发生的影响。本研究的主要目的是探讨当使用两种不同的分析方法时，解释变量的影响是否有不同的估计。使用的两种方法是:(1)Cox比例风险模型，该模型不为重大疾病保险事件分配特定原因，例如某种诊断;(2)竞争风险模型。用R软件估计与这些方法相关的回归模型。本文对结果进行了讨论和比较，结果表明，保险公司可能会受益于提供将特定诊断作为保险事件原因的政策。他们还表明，除了年龄之外，客户的性别在此类保险事件的发生中起着关键作用。

引用次数: 0

Cluster analysis via random partition distributions 通过随机分区分布进行聚类分析

Statistical Analysis and Data Mining: The ASA Data Science Journal

Pub Date : 2021-06-05 DOI: 10.1002/sam.11602

D. B. Dahl, J. Andros, J. Carter

Hierarchical and k‐medoids clustering are deterministic clustering algorithms defined on pairwise distances. We use these same pairwise distances in a novel stochastic clustering procedure based on a probability distribution. We call our proposed method CaviarPD, a portmanteau from cluster analysis via random partition distributions. CaviarPD first samples clusterings from a distribution on partitions and then finds the best cluster estimate based on these samples using algorithms to minimize an expected loss. Using eight case studies, we show that our approach produces results as close to the truth as hierarchical and k‐medoids methods, and has the additional advantage of allowing for a probabilistic framework to assess clustering uncertainty. The method provides an intuitive graphical representation of clustering uncertainty through pairwise probabilities from partition samples. A software implementation of the method is available in the CaviarPD package for R.

分层聚类和k - medium聚类是基于两两距离定义的确定性聚类算法。我们在基于概率分布的一种新的随机聚类过程中使用这些相同的两两距离。我们将提出的方法称为CaviarPD，这是一个来自随机分区分布的聚类分析的合成词。CaviarPD首先从分区上的分布中采样聚类，然后使用最小化预期损失的算法基于这些样本找到最佳的聚类估计。通过八个案例研究，我们表明我们的方法产生的结果与分层方法和k - medoids方法一样接近事实，并且具有允许概率框架评估聚类不确定性的额外优势。该方法通过分区样本的成对概率提供了一个直观的聚类不确定性的图形表示。该方法的软件实现可在R的CaviarPD包中获得。

引用次数: 3

Multi‐node Expectation–Maximization algorithm for finite mixture models 有限混合模型的多节点期望最大化算法

Statistical Analysis and Data Mining: The ASA Data Science Journal

Pub Date : 2021-06-05 DOI: 10.1002/sam.11529

Sharon X. Lee, G. McLachlan, Kaleb L. Leemaqz

Finite mixture models are powerful tools for modeling and analyzing heterogeneous data. Parameter estimation is typically carried out using maximum likelihood estimation via the Expectation–Maximization (EM) algorithm. Recently, the adoption of flexible distributions as component densities has become increasingly popular. Often, the EM algorithm for these models involves complicated expressions that are time‐consuming to evaluate numerically. In this paper, we describe a parallel implementation of the EM algorithm suitable for both single‐threaded and multi‐threaded processors and for both single machine and multiple‐node systems. Numerical experiments are performed to demonstrate the potential performance gain in different settings. Comparison is also made across two commonly used platforms—R and MATLAB. For illustration, a fairly general mixture model is used in the comparison.

有限混合模型是建模和分析异构数据的有力工具。参数估计通常是通过期望最大化(EM)算法使用最大似然估计进行的。最近，采用灵活分布作为组件密度变得越来越流行。通常，这些模型的EM算法涉及复杂的表达式，需要花费大量时间进行数值计算。在本文中，我们描述了一种适用于单线程和多线程处理器以及单机和多节点系统的EM算法的并行实现。数值实验证明了在不同设置下的潜在性能增益。还比较了两种常用的平台——r和MATLAB。为了说明，比较中使用了一个相当一般的混合模型。

引用次数: 0

Modeling and inference for mixtures of simple symmetric exponential families of p ‐dimensional distributions for vectors with binary coordinates 二元坐标下向量p维分布的简单对称指数族混合的建模与推理

Statistical Analysis and Data Mining: The ASA Data Science Journal

Pub Date : 2021-06-03 DOI: 10.1002/sam.11528

A. Chakraborty, S. Vardeman

We propose tractable symmetric exponential families of distributions for multivariate vectors of 0's and 1's in p dimensions, or what are referred to in this paper as binary vectors, that allow for nontrivial amounts of variation around some central value μ∈{0,1}p . We note that more or less standard asymptotics provides likelihood‐based inference in the one‐sample problem. We then consider mixture models where component distributions are of this form. Bayes analysis based on Dirichlet processes and Jeffreys priors for the exponential family parameters prove tractable and informative in problems where relevant distributions for a vector of binary variables are clearly not symmetric. We also extend our proposed Bayesian mixture model analysis to datasets with missing entries. Performance is illustrated through simulation studies and application to real datasets.

我们提出了p维上0和1的多元向量的可处理的对称指数族分布，或者在本文中被称为二进制向量，它允许在某个中心值μ∈{0,1}p周围的非平凡量的变化。我们注意到，在单样本问题中，或多或少的标准渐近提供了基于似然的推理。然后我们考虑混合模型，其中组件分布是这种形式。在二元变量向量的相关分布明显不对称的情况下，基于Dirichlet过程和Jeffreys先验的指数族参数的Bayes分析证明是可处理的和信息丰富的。我们还将提出的贝叶斯混合模型分析扩展到缺少条目的数据集。通过仿真研究和实际数据集的应用说明了性能。

引用次数: 0

Erratum to “Data‐driven dimension reduction in functional principal component analysis identifying the change‐point in functional data” 对“识别功能数据变化点的功能主成分分析中数据驱动的降维”的勘误

Statistical Analysis and Data Mining: The ASA Data Science Journal

Pub Date : 2021-06-01 DOI: 10.1002/sam.11510

In the article “Data-driven dimension reduction in functional principal component analysis identifying the change-point in functional data” published in the Statistical Analysis and Data Mining: The ASA Data Science Journal Vol. 13, No. 6, p. 535, the following sentence is added in the Acknowledgements section after the first online publication. “The research of the third author Mr. Arjun Lakra is supported by a grant from Council of Scientific and Industrial Research (CSIR Award No.: 09/081(1350)/2019-EMR-I), Government of India.” We apologize for this error.

在《统计分析与数据挖掘:ASA数据科学杂志》第13卷第6期第535页上发表的文章“功能主成分分析中的数据驱动降维识别功能数据中的变化点”中，在首次在线发表后的致谢部分添加了以下句子。“第三作者Arjun Lakra先生的研究得到了科学与工业研究理事会(CSIR奖编号:: 09/081(1350)/2019-EMR-I)，印度政府。”我们为这个错误道歉。

引用次数: 0

A practical extension of the recursive multi‐fidelity model for the emulation of hole closure experiments 对闭孔实验仿真的递推多保真度模型进行了实际推广

Statistical Analysis and Data Mining: The ASA Data Science Journal

Pub Date : 2021-05-25 DOI: 10.1002/sam.11513

Amanda Muyskens, Kathleen L. Schmidt, Matthew D. Nelms, N. Barton, J. Florando, A. Kupresanin, David Rivera

In regimes of high strain rate, the strength of materials often cannot be measured directly in experiments. Instead, the strength is inferred based on an experimental observable, such as a change in shape, that is matched by simulations supported by a known strength model. In hole closure experiments, the rate and degree to which a central hole in a plate of material closes during a dynamic loading event are used to infer material strength parameters. Due to the complexity of the experiment, many computationally expensive, three‐dimensional simulations are necessary to train an emulator for calibration or other analyses. These simulations can be run at multiple grid resolutions, where dense grids are slower but more accurate. In an effort to reduce the computational cost, a combination of simulations with different resolutions can be combined to develop an accurate emulator within a limited training time. We explore the novel design and construction of an appropriate functional recursive multi‐fidelity emulator of a strength model for tantalum in hole closure experiments that can be applied to arbitrarily large training data. Hence, by formulating a multi‐fidelity model to employ low‐fidelity simulations, we were able to reduce the error of our emulator by approximately 81% with only an approximately 1.6% increase in computing resource utilization.

在高应变率条件下，材料的强度往往不能在实验中直接测量。相反，强度是根据实验观察得出的，比如形状的变化，这与已知强度模型支持的模拟相匹配。在闭合孔实验中，材料板上的中心孔在动加载过程中闭合的速率和程度被用来推断材料的强度参数。由于实验的复杂性，许多计算昂贵的三维模拟是必要的，以训练模拟器进行校准或其他分析。这些模拟可以在多个网格分辨率下运行，其中密集的网格速度较慢，但更准确。为了降低计算成本，可以在有限的训练时间内将不同分辨率的仿真组合在一起，开发出精确的仿真器。我们探索了一种新颖的设计和构建合适的功能递归多保真度模拟器，用于钽闭孔实验的强度模型，可以应用于任意大的训练数据。因此，通过制定一个多保真度模型来采用低保真度模拟，我们能够将模拟器的误差减少约81%，而计算资源利用率仅增加约1.6%。

{"title":"A practical extension of the recursive multi‐fidelity model for the emulation of hole closure experiments","authors":"Amanda Muyskens, Kathleen L. Schmidt, Matthew D. Nelms, N. Barton, J. Florando, A. Kupresanin, David Rivera","doi":"10.1002/sam.11513","DOIUrl":"https://doi.org/10.1002/sam.11513","url":null,"abstract":"In regimes of high strain rate, the strength of materials often cannot be measured directly in experiments. Instead, the strength is inferred based on an experimental observable, such as a change in shape, that is matched by simulations supported by a known strength model. In hole closure experiments, the rate and degree to which a central hole in a plate of material closes during a dynamic loading event are used to infer material strength parameters. Due to the complexity of the experiment, many computationally expensive, three‐dimensional simulations are necessary to train an emulator for calibration or other analyses. These simulations can be run at multiple grid resolutions, where dense grids are slower but more accurate. In an effort to reduce the computational cost, a combination of simulations with different resolutions can be combined to develop an accurate emulator within a limited training time. We explore the novel design and construction of an appropriate functional recursive multi‐fidelity emulator of a strength model for tantalum in hole closure experiments that can be applied to arbitrarily large training data. Hence, by formulating a multi‐fidelity model to employ low‐fidelity simulations, we were able to reduce the error of our emulator by approximately 81% with only an approximately 1.6% increase in computing resource utilization.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116804886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Weighted pivot coordinates for partial least squares‐based marker discovery in high‐throughput compositional data 在高通量成分数据中基于偏最小二乘的标记发现的加权枢轴坐标

Statistical Analysis and Data Mining: The ASA Data Science Journal

Pub Date : 2021-05-19 DOI: 10.1002/sam.11514

N. Štefelová, J. Palarea‐Albaladejo, K. Hron

High‐throughput data representing large mixtures of chemical or biological signals are ordinarily produced in the molecular sciences. Given a number of samples, partial least squares (PLS) regression is a well‐established statistical method to investigate associations between them and any continuous response variables of interest. However, technical artifacts generally make the raw signals not directly comparable between samples. Thus, data normalization is required before any meaningful scientific information can be drawn. This often allows to characterize the processed signals as compositional data where the relevant information is contained in the pairwise log‐ratios between the components of the mixture. The (log‐ratio) pivot coordinate approach facilitates the aggregation into single variables of the pairwise log‐ratios of a component to all the remaining components. This simplifies interpretability and the investigation of their relative importance but, particularly in a high‐dimensional context, the aggregated log‐ratios can easily mix up information from different underlaying processes. In this context, we propose a weighting strategy for the construction of pivot coordinates for PLS regression which draws on the correlation between response variable and pairwise log‐ratios. Using real and simulated data sets, we demonstrate that this proposal enhances the discovery of biological markers in high‐throughput compositional data.

高通量数据表示大量化学或生物信号的混合物通常在分子科学中产生。给定一些样本，偏最小二乘(PLS)回归是一种完善的统计方法，用于研究它们与任何感兴趣的连续响应变量之间的关联。然而，技术上的人为因素通常使原始信号不能在样本之间直接比较。因此，在绘制任何有意义的科学信息之前，需要对数据进行规范化。这通常允许将处理后的信号表征为成分数据，其中相关信息包含在混合物组分之间的成对对数比中。(log - ratio)枢轴坐标方法有助于将一个组件与所有其余组件的成对对数比率聚合为单个变量。这简化了可解释性和对其相对重要性的调查，但是，特别是在高维环境中，汇总的对数比很容易混淆来自不同底层过程的信息。在这种情况下，我们提出了一种加权策略，用于构建PLS回归的枢轴坐标，该策略利用了响应变量和成对对数比之间的相关性。使用真实和模拟数据集，我们证明了该建议增强了高通量成分数据中生物标记物的发现。

{"title":"Weighted pivot coordinates for partial least squares‐based marker discovery in high‐throughput compositional data","authors":"N. Štefelová, J. Palarea‐Albaladejo, K. Hron","doi":"10.1002/sam.11514","DOIUrl":"https://doi.org/10.1002/sam.11514","url":null,"abstract":"High‐throughput data representing large mixtures of chemical or biological signals are ordinarily produced in the molecular sciences. Given a number of samples, partial least squares (PLS) regression is a well‐established statistical method to investigate associations between them and any continuous response variables of interest. However, technical artifacts generally make the raw signals not directly comparable between samples. Thus, data normalization is required before any meaningful scientific information can be drawn. This often allows to characterize the processed signals as compositional data where the relevant information is contained in the pairwise log‐ratios between the components of the mixture. The (log‐ratio) pivot coordinate approach facilitates the aggregation into single variables of the pairwise log‐ratios of a component to all the remaining components. This simplifies interpretability and the investigation of their relative importance but, particularly in a high‐dimensional context, the aggregated log‐ratios can easily mix up information from different underlaying processes. In this context, we propose a weighting strategy for the construction of pivot coordinates for PLS regression which draws on the correlation between response variable and pairwise log‐ratios. Using real and simulated data sets, we demonstrate that this proposal enhances the discovery of biological markers in high‐throughput compositional data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130172352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Evaluating causal‐based feature selection for fuel property prediction models 评估基于因果关系的燃料特性预测模型的特征选择

Statistical Analysis and Data Mining: The ASA Data Science Journal

Pub Date : 2021-05-11 DOI: 10.1002/sam.11511

Bernard Nguyen, Leanne S. Whitmore, Anthe George, Corey M. Hudson

In‐silico screening of novel biofuel molecules based on chemical and fuel properties is a critical first step in the biofuel evaluation process due to the significant volumes of samples required for experimental testing, the destructive nature of engine tests, and the costs associated with bench‐scale synthesis of novel fuels. Predictive models are limited by training sets of few existing measurements, often containing similar classes of molecules that represent just a subset of the potential molecular fuel space. Software tools can be used to generate every possible molecular descriptor for use as input features, but most of these features are largely irrelevant and training models on datasets with higher dimensionality than size tends to yield poor predictive performance. Feature selection has been shown to improve machine learning models, but correlation‐based feature selection fails to provide scientific insight into the underlying mechanisms that determine structure–property relationships. The implementation of causal discovery in feature selection could potentially inform the biofuel design process while also improving model prediction accuracy and robustness to new data. In this study, we investigate the benefits causal‐based feature selection might have on both model performance and identification of key molecular substructures. We found that causal‐based feature selection performed on par with alternative filtration methods, and that a structural causal model provides valuable scientific insights into the relationships between molecular substructures and fuel properties.

基于化学和燃料特性的新型生物燃料分子的硅筛选是生物燃料评估过程中至关重要的第一步，因为实验测试需要大量样品，发动机测试的破坏性以及与实验规模合成新燃料相关的成本。预测模型受到少数现有测量的训练集的限制，通常包含类似的分子类别，仅代表潜在分子燃料空间的一个子集。软件工具可以用来生成每一个可能的分子描述符作为输入特征，但是这些特征中的大多数在很大程度上是不相关的，并且在维度高于尺寸的数据集上训练模型往往会产生较差的预测性能。特征选择已被证明可以改善机器学习模型，但基于相关性的特征选择无法为确定结构-属性关系的潜在机制提供科学的见解。在特征选择中实现因果关系发现可以潜在地为生物燃料设计过程提供信息，同时还可以提高模型预测的准确性和对新数据的鲁棒性。在本研究中，我们研究了基于因果关系的特征选择对模型性能和关键分子亚结构识别的好处。我们发现基于因果关系的特征选择与其他过滤方法的表现相当，并且结构因果模型为分子亚结构和燃料特性之间的关系提供了有价值的科学见解。

{"title":"Evaluating causal‐based feature selection for fuel property prediction models","authors":"Bernard Nguyen, Leanne S. Whitmore, Anthe George, Corey M. Hudson","doi":"10.1002/sam.11511","DOIUrl":"https://doi.org/10.1002/sam.11511","url":null,"abstract":"In‐silico screening of novel biofuel molecules based on chemical and fuel properties is a critical first step in the biofuel evaluation process due to the significant volumes of samples required for experimental testing, the destructive nature of engine tests, and the costs associated with bench‐scale synthesis of novel fuels. Predictive models are limited by training sets of few existing measurements, often containing similar classes of molecules that represent just a subset of the potential molecular fuel space. Software tools can be used to generate every possible molecular descriptor for use as input features, but most of these features are largely irrelevant and training models on datasets with higher dimensionality than size tends to yield poor predictive performance. Feature selection has been shown to improve machine learning models, but correlation‐based feature selection fails to provide scientific insight into the underlying mechanisms that determine structure–property relationships. The implementation of causal discovery in feature selection could potentially inform the biofuel design process while also improving model prediction accuracy and robustness to new data. In this study, we investigate the benefits causal‐based feature selection might have on both model performance and identification of key molecular substructures. We found that causal‐based feature selection performed on par with alternative filtration methods, and that a structural causal model provides valuable scientific insights into the relationships between molecular substructures and fuel properties.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121057193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Markov chain to analyze web usability of a university website using eye tracking data 马尔可夫链利用眼动追踪数据分析大学网站的可用性

Statistical Analysis and Data Mining: The ASA Data Science Journal

Pub Date : 2021-05-10 DOI: 10.1002/sam.11512

Gianpaolo Zammarchi, L. Frigau, F. Mola

Web usability is a crucial feature of a website, allowing users to easily find information in a short time. Eye tracking data registered during the execution of tasks allow to measure web usability in a more objective way compared to questionnaires. In this work, we evaluated the web usability of the website of the University of Cagliari through the analysis of eye tracking data with qualitative and quantitative methods. Performances of two groups of students (i.e., high school and university students) across 10 different tasks were compared in terms of time to completion, number of fixations and difficulty ratio. Transitions between different areas of interest (AOI) were analyzed in the two groups using Markov chain. For the majority of tasks, we did not observe significant differences in the performances of the two groups, suggesting that the information needed to complete the tasks could easily be retrieved by students with little previous experience in using the website. For a specific task, high school students showed a worse performance based on the number of fixations and a different Markov chain stationary distribution compared to university students. These results allowed to highlight elements of the pages that can be modified to improve web usability.

网站可用性是网站的一个重要特征，它允许用户在短时间内轻松地找到信息。与问卷调查相比，在任务执行过程中记录的眼动追踪数据可以以更客观的方式衡量网络可用性。在这项工作中，我们通过使用定性和定量方法分析眼动追踪数据，评估卡利亚里大学网站的web可用性。两组学生(即高中生和大学生)在10项不同任务中的表现在完成时间、注视次数和难度比方面进行了比较。利用马尔可夫链分析两组不同兴趣区域(AOI)之间的转换。对于大多数任务，我们没有观察到两组的表现有显著差异，这表明完成任务所需的信息很容易被之前使用网站经验较少的学生检索到。在特定的任务中，基于注视次数和不同的马尔可夫链平稳分布，高中生的表现比大学生差。这些结果可以突出显示可以修改的页面元素，以提高web可用性。

{"title":"Markov chain to analyze web usability of a university website using eye tracking data","authors":"Gianpaolo Zammarchi, L. Frigau, F. Mola","doi":"10.1002/sam.11512","DOIUrl":"https://doi.org/10.1002/sam.11512","url":null,"abstract":"Web usability is a crucial feature of a website, allowing users to easily find information in a short time. Eye tracking data registered during the execution of tasks allow to measure web usability in a more objective way compared to questionnaires. In this work, we evaluated the web usability of the website of the University of Cagliari through the analysis of eye tracking data with qualitative and quantitative methods. Performances of two groups of students (i.e., high school and university students) across 10 different tasks were compared in terms of time to completion, number of fixations and difficulty ratio. Transitions between different areas of interest (AOI) were analyzed in the two groups using Markov chain. For the majority of tasks, we did not observe significant differences in the performances of the two groups, suggesting that the information needed to complete the tasks could easily be retrieved by students with little previous experience in using the website. For a specific task, high school students showed a worse performance based on the number of fixations and a different Markov chain stationary distribution compared to university students. These results allowed to highlight elements of the pages that can be modified to improve web usability.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130065955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Statistical Analysis and Data Mining: The ASA Data Science Journal

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀