Statistical Analysis and Data Mining最新文献

英文中文

A study of the impact of COVID-19 on the Chinese stock market based on a new textual multiple ARMA model. 基于一个新的文本多重ARMA模型的COVID-19对中国股市影响研究

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2022-04-04 DOI: 10.1002/sam.11582

Weijun Xu, Zhineng Fu, Hongyi Li, Jinglong Huang, Weidong Xu, Yiyang Luo

Coronavirus 2019 (COVID-19) has caused violent fluctuation in stock markets, and led to heated discussion in stock forums. The rise and fall of any specific stock is influenced by many other stocks and emotions expressed in forum discussions. Considering the transmission effect of emotions, we propose a new Textual Multiple Auto Regressive Moving Average (TM-ARMA) model to study the impact of COVID-19 on the Chinese stock market. The TM-ARMA model contains a new cross-textual term and a new cross-auto regressive (AR) term that measure the cross impacts of textual emotions and price fluctuations, respectively, and the adjacent matrix which measures the relationships among stocks is updated dynamically. We compute the textual sentiment scores by an emotion dictionary-based method, and estimate the parameter matrices by a maximum likelihood method. Our dataset includes the textual posts from the Eastmoney Stock Forum and the price data for the constituent stocks of the FTSE China A50 Index. We conduct a sliding-window online forecast approach to simulate the real-trading situations. The results show that TM-ARMA performs very well even after the attack of COVID-19.

2019冠状病毒（COVID-19）导致股市剧烈波动，并在股票论坛上引发激烈讨论。任何特定股票的涨跌都会受到许多其他股票和论坛讨论中表达的情绪的影响。考虑到情绪的传递效应，我们提出了一个新的文本多重自回归移动平均（TM‐ARMA）模型来研究新冠肺炎对中国股市的影响。TM‐ARMA模型包含一个新的跨文本术语和一个新交叉自回归（AR）术语，分别测量文本情绪和价格波动的交叉影响，并动态更新测量股票之间关系的相邻矩阵。我们通过基于情感字典的方法计算文本情感得分，并通过最大似然方法估计参数矩阵。我们的数据集包括东钱证券论坛的文本帖子和富时中国A50指数成分股的价格数据。我们采用滑动窗口在线预测方法来模拟真实的交易情况。结果表明，即使在新冠肺炎发作后，TM‐ARMA也表现良好。

{"title":"A study of the impact of COVID-19 on the Chinese stock market based on a new textual multiple ARMA model.","authors":"Weijun Xu, Zhineng Fu, Hongyi Li, Jinglong Huang, Weidong Xu, Yiyang Luo","doi":"10.1002/sam.11582","DOIUrl":"10.1002/sam.11582","url":null,"abstract":"Coronavirus 2019 (COVID-19) has caused violent fluctuation in stock markets, and led to heated discussion in stock forums. The rise and fall of any specific stock is influenced by many other stocks and emotions expressed in forum discussions. Considering the transmission effect of emotions, we propose a new Textual Multiple Auto Regressive Moving Average (TM-ARMA) model to study the impact of COVID-19 on the Chinese stock market. The TM-ARMA model contains a new cross-textual term and a new cross-auto regressive (AR) term that measure the cross impacts of textual emotions and price fluctuations, respectively, and the adjacent matrix which measures the relationships among stocks is updated dynamically. We compute the textual sentiment scores by an emotion dictionary-based method, and estimate the parameter matrices by a maximum likelihood method. Our dataset includes the textual posts from the Eastmoney Stock Forum and the price data for the constituent stocks of the FTSE China A50 Index. We conduct a sliding-window online forecast approach to simulate the real-trading situations. The results show that TM-ARMA performs very well even after the attack of COVID-19.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"1 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2022-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9111149/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44032709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sample Selection Bias in Evaluation of Prediction Performance of Causal Models. 因果模型预测性能评价中的样本选择偏差。

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2022-02-01 DOI: 10.1002/sam.11559

James P Long, Min Jin Ha

Causal models are notoriously difficult to validate because they make untestable assumptions regarding confounding. New scientific experiments offer the possibility of evaluating causal models using prediction performance. Prediction performance measures are typically robust to violations in causal assumptions. However prediction performance does depend on the selection of training and test sets. In particular biased training sets can lead to optimistic assessments of model performance. In this work, we revisit the prediction performance of several recently proposed causal models tested on a genetic perturbation data set of Kemmeren [5]. We find that sample selection bias is likely a key driver of model performance. We propose using a less-biased evaluation set for assessing prediction performance and compare models on this new set. In this setting, the causal models have similar or worse performance compared to standard association based estimators such as Lasso. Finally we compare the performance of causal estimators in simulation studies which reproduce the Kemmeren structure of genetic knockout experiments but without any sample selection bias. These results provide an improved understanding of the performance of several causal models and offer guidance on how future studies should use Kemmeren.

众所周知，因果模型很难验证，因为它们对混淆做出了不可检验的假设。新的科学实验提供了利用预测性能评估因果模型的可能性。预测性能度量通常对因果假设的违反具有鲁棒性。然而，预测性能确实依赖于训练集和测试集的选择。特别是有偏差的训练集可以导致对模型性能的乐观评估。在这项工作中，我们回顾了最近提出的几个因果模型的预测性能，这些模型在Kemmeren的遗传扰动数据集上进行了测试[5]。我们发现样本选择偏差可能是模型性能的关键驱动因素。我们建议使用一个较少偏差的评估集来评估预测性能，并在这个新集上比较模型。在这种情况下，与基于标准关联的估计器(如Lasso)相比，因果模型具有类似或更差的性能。最后，我们比较了因果估计器在模拟研究中的性能，这些模拟研究再现了基因敲除实验的Kemmeren结构，但没有任何样本选择偏差。这些结果提供了对几个因果模型性能的更好理解，并为未来的研究如何使用Kemmeren提供了指导。

{"title":"Sample Selection Bias in Evaluation of Prediction Performance of Causal Models.","authors":"James P Long, Min Jin Ha","doi":"10.1002/sam.11559","DOIUrl":"https://doi.org/10.1002/sam.11559","url":null,"abstract":"Causal models are notoriously difficult to validate because they make untestable assumptions regarding confounding. New scientific experiments offer the possibility of evaluating causal models using prediction performance. Prediction performance measures are typically robust to violations in causal assumptions. However prediction performance does depend on the selection of training and test sets. In particular biased training sets can lead to optimistic assessments of model performance. In this work, we revisit the prediction performance of several recently proposed causal models tested on a genetic perturbation data set of Kemmeren [5]. We find that sample selection bias is likely a key driver of model performance. We propose using a less-biased evaluation set for assessing prediction performance and compare models on this new set. In this setting, the causal models have similar or worse performance compared to standard association based estimators such as Lasso. Finally we compare the performance of causal estimators in simulation studies which reproduce the Kemmeren structure of genetic knockout experiments but without any sample selection bias. These results provide an improved understanding of the performance of several causal models and offer guidance on how future studies should use Kemmeren.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"15 1","pages":"5-14"},"PeriodicalIF":1.3,"publicationDate":"2022-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9053600/pdf/nihms-1746637.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10589307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A framework for stability-based module detection in correlation graphs. 相关图中基于稳定性的模块检测框架。

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2021-04-01 Epub Date: 2021-01-08 DOI: 10.1002/sam.11495

Mingmei Tian, Rachael Hageman Blair, Lina Mu, Matthew Bonner, Richard Browne, Han Yu

Graphs can be used to represent the direct and indirect relationships between variables, and elucidate complex relationships and interdependencies. Detecting structure within a graph is a challenging problem. This problem is studied over a range of fields and is sometimes termed community detection, module detection, or graph partitioning. A popular class of algorithms for module detection relies on optimizing a function of modularity to identify the structure. In practice, graphs are often learned from the data, and thus prone to uncertainty. In these settings, the uncertainty of the network structure can become exaggerated by giving unreliable estimates of the module structure. In this work, we begin to address this challenge through the use of a nonparametric bootstrap approach to assessing the stability of module detection in a graph. Estimates of stability are presented at the level of the individual node, the inferred modules, and as an overall measure of performance for module detection in a given graph. Furthermore, bootstrap stability estimates are derived for complexity parameter selection that ultimately defines a graph from data in a way that optimizes stability. This approach is utilized in connection with correlation graphs but is generalizable to other graphs that are defined through the use of dissimilarity measures. We demonstrate our approach using a broad range of simulations and on a metabolomics dataset from the Beijing Olympics Air Pollution study. These approaches are implemented using bootcluster package that is available in the R programming language.

图形可用来表示变量之间的直接和间接关系，并阐明复杂的关系和相互依存关系。检测图中的结构是一个具有挑战性的问题。这个问题在很多领域都有研究，有时被称为群体检测、模块检测或图分割。模块检测的一类流行算法依赖于优化模块化函数来识别结构。在实践中，图通常是从数据中学来的，因此容易产生不确定性。在这种情况下，网络结构的不确定性可能会被夸大，对模块结构做出不可靠的估计。在这项工作中，我们通过使用非参数自举法评估图中模块检测的稳定性，开始应对这一挑战。对稳定性的估算体现在单个节点、推断模块的层面，以及对给定图中模块检测性能的整体衡量。此外，还针对复杂性参数选择得出了自举稳定性估计值，最终以优化稳定性的方式从数据中定义图形。这种方法适用于相关图，但也适用于其他通过使用差异度量来定义的图。我们利用各种模拟和北京奥运会空气污染研究的代谢组学数据集演示了我们的方法。这些方法使用 R 编程语言中的 bootcluster 软件包实现。

{"title":"A framework for stability-based module detection in correlation graphs.","authors":"Mingmei Tian, Rachael Hageman Blair, Lina Mu, Matthew Bonner, Richard Browne, Han Yu","doi":"10.1002/sam.11495","DOIUrl":"10.1002/sam.11495","url":null,"abstract":"Graphs can be used to represent the direct and indirect relationships between variables, and elucidate complex relationships and interdependencies. Detecting structure within a graph is a challenging problem. This problem is studied over a range of fields and is sometimes termed community detection, module detection, or graph partitioning. A popular class of algorithms for module detection relies on optimizing a function of modularity to identify the structure. In practice, graphs are often learned from the data, and thus prone to uncertainty. In these settings, the uncertainty of the network structure can become exaggerated by giving unreliable estimates of the module structure. In this work, we begin to address this challenge through the use of a nonparametric bootstrap approach to assessing the stability of module detection in a graph. Estimates of stability are presented at the level of the individual node, the inferred modules, and as an overall measure of performance for module detection in a given graph. Furthermore, bootstrap stability estimates are derived for complexity parameter selection that ultimately defines a graph from data in a way that optimizes stability. This approach is utilized in connection with correlation graphs but is generalizable to other graphs that are defined through the use of dissimilarity measures. We demonstrate our approach using a broad range of simulations and on a metabolomics dataset from the Beijing Olympics Air Pollution study. These approaches are implemented using bootcluster package that is available in the R programming language.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"14 2","pages":"129-143"},"PeriodicalIF":1.3,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7986843/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25525965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unsupervised random forests. 无监督随机森林。

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2021-04-01 Epub Date: 2021-02-05 DOI: 10.1002/sam.11498

Alejandro Mantero, Hemant Ishwaran

sidClustering is a new random forests unsupervised machine learning algorithm. The first step in sidClustering involves what is called sidification of the features: staggering the features to have mutually exclusive ranges (called the staggered interaction data [SID] main features) and then forming all pairwise interactions (called the SID interaction features). Then a multivariate random forest (able to handle both continuous and categorical variables) is used to predict the SID main features. We establish uniqueness of sidification and show how multivariate impurity splitting is able to identify clusters. The proposed sidClustering method is adept at finding clusters arising from categorical and continuous variables and retains all the important advantages of random forests. The method is illustrated using simulated and real data as well as two in depth case studies, one from a large multi-institutional study of esophageal cancer, and the other involving hospital charges for cardiovascular patients.

聚类是一种新的随机森林无监督机器学习算法。sidClustering的第一步涉及所谓的特征的sidification:将特征错开以具有互斥范围(称为交错交互数据[SID]主要特征)，然后形成所有成对交互(称为SID交互特征)。然后使用多变量随机森林(能够处理连续变量和分类变量)来预测SID的主要特征。我们建立唯一性的sification和显示如何多元杂质分裂能够识别簇。所提出的sidClustering方法善于发现由分类变量和连续变量引起的聚类，并且保留了随机森林的所有重要优点。该方法使用模拟和真实数据以及两个深入的案例研究来说明，一个来自食管癌的大型多机构研究，另一个涉及心血管患者的医院收费。

引用次数: 14

A clustering method for graphical handwriting components and statistical writership analysis. 用于图形笔迹成分和统计笔迹分析的聚类方法。

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2021-02-01 Epub Date: 2020-11-24 DOI: 10.1002/sam.11488

Amy M Crawford, Nicholas S Berry, Alicia L Carriquiry

Handwritten documents can be characterized by their content or by the shape of the written characters. We focus on the problem of comparing a person's handwriting to a document of unknown provenance using the shape of the writing, as is done in forensic applications. To do so, we first propose a method for processing scanned handwritten documents to decompose the writing into small graphical structures, often corresponding to letters. We then introduce a measure of distance between two such structures that is inspired by the graph edit distance, and a measure of center for a collection of the graphs. These measurements are the basis for an outlier tolerant K-means algorithm to cluster the graphs based on structural attributes, thus creating a template for sorting new documents. Finally, we present a Bayesian hierarchical model to capture the propensity of a writer for producing graphs that are assigned to certain clusters. We illustrate the methods using documents from the Computer Vision Lab dataset. We show results of the identification task under the cluster assignments and compare to the same modeling, but with a less flexible grouping method that is not tolerant of incidental strokes or outliers.

手写文件的特征可以是内容，也可以是书写字符的形状。我们重点关注的问题是，如何像法医应用中那样，利用笔迹的形状将一个人的笔迹与来源不明的文档进行比较。为此，我们首先提出了一种处理扫描手写文件的方法，将笔迹分解成小的图形结构，通常与字母相对应。然后，我们受图形编辑距离的启发，引入了两个此类结构之间的距离测量方法，以及图形集合的中心测量方法。这些测量方法是容许离群值的 K-means 算法的基础，该算法可根据结构属性对图形进行聚类，从而创建一个用于对新文档进行排序的模板。最后，我们提出了一个贝叶斯分层模型，以捕捉作者产生被分配到特定聚类的图形的倾向。我们使用计算机视觉实验室数据集中的文档来说明这些方法。我们展示了聚类分配下的识别任务结果，并与相同的建模方法进行了比较，但后者采用的分组方法灵活性较差，不能容忍偶然的笔画或异常值。

{"title":"A clustering method for graphical handwriting components and statistical writership analysis.","authors":"Amy M Crawford, Nicholas S Berry, Alicia L Carriquiry","doi":"10.1002/sam.11488","DOIUrl":"10.1002/sam.11488","url":null,"abstract":"Handwritten documents can be characterized by their content or by the shape of the written characters. We focus on the problem of comparing a person's handwriting to a document of unknown provenance using the shape of the writing, as is done in forensic applications. To do so, we first propose a method for processing scanned handwritten documents to decompose the writing into small graphical structures, often corresponding to letters. We then introduce a measure of distance between two such structures that is inspired by the graph edit distance, and a measure of center for a collection of the graphs. These measurements are the basis for an outlier tolerant K-means algorithm to cluster the graphs based on structural attributes, thus creating a template for sorting new documents. Finally, we present a Bayesian hierarchical model to capture the propensity of a writer for producing graphs that are assigned to certain clusters. We illustrate the methods using documents from the Computer Vision Lab dataset. We show results of the identification task under the cluster assignments and compare to the same modeling, but with a less flexible grouping method that is not tolerant of incidental strokes or outliers.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"14 1","pages":"41-60"},"PeriodicalIF":1.3,"publicationDate":"2021-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7894190/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25432031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scalable network estimation with L ₀ penalty. 具有l0损失的可扩展网络估计。

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2021-02-01 Epub Date: 2020-10-21 DOI: 10.1002/sam.11483

Junghi Kim, Hongtu Zhu, Xiao Wang, Kim-Anh Do

With the advent of high-throughput sequencing, an efficient computing strategy is required to deal with large genomic data sets. The challenge of estimating a large precision matrix has garnered substantial research attention for its direct application to discriminant analyses and graphical models. Most existing methods either use a lasso-type penalty that may lead to biased estimators or are computationally intensive, which prevents their applications to very large graphs. We propose using an L ₀ penalty to estimate an ultra-large precision matrix (scalnetL0). We apply scalnetL0 to RNA-seq data from breast cancer patients represented in The Cancer Genome Atlas and find improved accuracy of classifications for survival times. The estimated precision matrix provides information about a large-scale co-expression network in breast cancer. Simulation studies demonstrate that scalnetL0 provides more accurate and efficient estimators, yielding shorter CPU time and less Frobenius loss on sparse learning for large-scale precision matrix estimation.

随着高通量测序的出现，需要一种高效的计算策略来处理大量的基因组数据集。估计一个大精度矩阵的挑战已经获得了大量的研究关注，因为它直接应用于判别分析和图形模型。大多数现有的方法要么使用可能导致有偏估计的套索类型惩罚，要么是计算密集型的，这阻碍了它们在非常大的图中的应用。我们建议使用l0惩罚来估计超大精度矩阵(scalnetL0)。我们将scalnetL0应用于癌症基因组图谱中乳腺癌患者的RNA-seq数据，发现生存时间分类的准确性得到了提高。估计的精度矩阵提供了关于乳腺癌中大规模共表达网络的信息。仿真研究表明，scalnetL0提供了更准确和高效的估计器，在大规模精确矩阵估计的稀疏学习中产生更短的CPU时间和更少的Frobenius损失。

引用次数: 1

Knot selection in sparse Gaussian processes with a variational objective function. 具有变异目标函数的稀疏高斯过程中的结点选择。

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2020-08-01 Epub Date: 2020-04-20 DOI: 10.1002/sam.11459

Nathaniel Garton, Jarad Niemi, Alicia Carriquiry

Sparse, knot-based Gaussian processes have enjoyed considerable success as scalable approximations of full Gaussian processes. Certain sparse models can be derived through specific variational approximations to the true posterior, and knots can be selected to minimize the Kullback-Leibler divergence between the approximate and true posterior. While this has been a successful approach, simultaneous optimization of knots can be slow due to the number of parameters being optimized. Furthermore, there have been few proposed methods for selecting the number of knots, and no experimental results exist in the literature. We propose using a one-at-a-time knot selection algorithm based on Bayesian optimization to select the number and locations of knots. We showcase the competitive performance of this method relative to optimization of knots simultaneously on three benchmark datasets, but at a fraction of the computational cost.

作为全高斯过程的可扩展近似值，基于节点的稀疏高斯过程取得了巨大成功。某些稀疏模型可以通过对真实后验的特定变分近似得到，可以选择结点来最小化近似后验与真实后验之间的库尔贝-莱布勒发散。虽然这是一种成功的方法，但由于需要优化的参数较多，同时优化节点的速度可能会很慢。此外，关于选择节点数量的建议方法很少，文献中也没有实验结果。我们建议使用基于贝叶斯优化的一次性节点选择算法来选择节点的数量和位置。我们在三个基准数据集上展示了该方法相对于同时优化节点的竞争性能，但计算成本仅为后者的一小部分。

引用次数: 0

An algorithm to compare two-dimensional footwear outsole images using maximum cliques and speeded-up robust feature. 利用最大聚类和加速鲁棒特征比较二维鞋底图像的算法。

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2020-04-01 Epub Date: 2020-02-21 DOI: 10.1002/sam.11449

Soyoung Park, Alicia Carriquiry

Footwear examiners are tasked with comparing an outsole impression (Q) left at a crime scene with an impression (K) from a database or from the suspect's shoe. We propose a method for comparing two shoe outsole impressions that relies on robust features (speeded-up robust feature; SURF) on each impression and aligns them using a maximum clique (MC). After alignment, an algorithm we denote MC-COMP is used to extract additional features that are then combined into a univariate similarity score using a random forest (RF). We use a database of shoe outsole impressions that includes images from two models of athletic shoes that were purchased new and then worn by study participants for about 6 months. The shoes share class characteristics such as outsole pattern and size, and thus the comparison is challenging. We find that the RF implemented on SURF outperforms other methods recently proposed in the literature in terms of classification precision. In more realistic scenarios where crime scene impressions may be degraded and smudged, the algorithm we propose-denoted MC-COMP-SURF-shows the best classification performance by detecting unique features better than other methods. The algorithm can be implemented with the R-package shoeprintr.

鞋类检验员的任务是将犯罪现场留下的鞋底印痕（Q）与数据库中或嫌疑人鞋上的印痕（K）进行比较。我们提出了一种比较两个鞋底印迹的方法，该方法依赖于每个印迹上的鲁棒特征（加速鲁棒特征；SURF），并使用最大聚类（MC）对它们进行对齐。对齐后，使用我们称为 MC-COMP 的算法提取其他特征，然后使用随机森林（RF）将这些特征组合成单变量相似性得分。我们使用了一个鞋底印记数据库，其中包括两款运动鞋的图像，这两款鞋都是新买的，研究对象穿了大约 6 个月。这两款鞋具有相同的类别特征，如鞋底花纹和尺寸，因此比较具有挑战性。我们发现，就分类精度而言，在 SURF 基础上实现的 RF 优于最近在文献中提出的其他方法。在更现实的场景中，犯罪现场的印记可能会退化和污损，而我们提出的算法（命名为 MC-COMP-SURF）能更好地检测出独特特征，因而比其他方法显示出最佳的分类性能。该算法可通过 R 包 shoeprintr 实现。

{"title":"An algorithm to compare two-dimensional footwear outsole images using maximum cliques and speeded-up robust feature.","authors":"Soyoung Park, Alicia Carriquiry","doi":"10.1002/sam.11449","DOIUrl":"10.1002/sam.11449","url":null,"abstract":"Footwear examiners are tasked with comparing an outsole impression (Q) left at a crime scene with an impression (K) from a database or from the suspect's shoe. We propose a method for comparing two shoe outsole impressions that relies on robust features (speeded-up robust feature; SURF) on each impression and aligns them using a maximum clique (MC). After alignment, an algorithm we denote MC-COMP is used to extract additional features that are then combined into a univariate similarity score using a random forest (RF). We use a database of shoe outsole impressions that includes images from two models of athletic shoes that were purchased new and then worn by study participants for about 6 months. The shoes share class characteristics such as outsole pattern and size, and thus the comparison is challenging. We find that the RF implemented on SURF outperforms other methods recently proposed in the literature in terms of classification precision. In more realistic scenarios where crime scene impressions may be degraded and smudged, the algorithm we propose-denoted MC-COMP-SURF-shows the best classification performance by detecting unique features better than other methods. The algorithm can be implemented with the R-package shoeprintr.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"13 2","pages":"188-199"},"PeriodicalIF":1.3,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7079556/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37774263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Practical Bayesian Modeling and Inference for Massive Spatial Datasets On Modest Computing Environments. 适度计算环境下海量空间数据集的实用贝叶斯建模与推理。

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2019-06-01 DOI: 10.1002/sam.11413

Lu Zhang, Abhirup Datta, Sudipto Banerjee

With continued advances in Geographic Information Systems and related computational technologies, statisticians are often required to analyze very large spatial datasets. This has generated substantial interest over the last decade, already too vast to be summarized here, in scalable methodologies for analyzing large spatial datasets. Scalable spatial process models have been found especially attractive due to their richness and flexibility and, particularly so in the Bayesian paradigm, due to their presence in hierarchical model settings. However, the vast majority of research articles present in this domain have been geared toward innovative theory or more complex model development. Very limited attention has been accorded to approaches for easily implementable scalable hierarchical models for the practicing scientist or spatial analyst. This article devises massively scalable Bayesian approaches that can rapidly deliver inference on spatial process that are practically indistinguishable from inference obtained using more expensive alternatives. A key emphasis is on implementation within very standard (modest) computing environments (e.g., a standard desktop or laptop) using easily available statistical software packages. Key insights are offered regarding assumptions and approximations concerning practical efficiency.

随着地理信息系统和相关计算技术的不断进步，统计学家经常需要分析非常大的空间数据集。在过去的十年里，这已经引起了人们对分析大型空间数据集的可扩展方法的极大兴趣。可扩展的空间过程模型由于其丰富性和灵活性，特别是在贝叶斯范式中，由于它们存在于分层模型设置中，而被发现特别有吸引力。然而，该领域的绝大多数研究文章都是面向创新理论或更复杂的模型开发的。对于为实践科学家或空间分析人员提供易于实现的可扩展层次模型的方法，人们的关注非常有限。本文设计了大规模可扩展的贝叶斯方法，可以快速提供空间过程的推理，与使用更昂贵的替代方法获得的推理几乎没有区别。重点是在非常标准(适度)的计算环境(例如，标准桌面或笔记本电脑)中使用易于获得的统计软件包实现。提出了关于实际效率的假设和近似的关键见解。

{"title":"Practical Bayesian Modeling and Inference for Massive Spatial Datasets On Modest Computing Environments.","authors":"Lu Zhang, Abhirup Datta, Sudipto Banerjee","doi":"10.1002/sam.11413","DOIUrl":"https://doi.org/10.1002/sam.11413","url":null,"abstract":"With continued advances in Geographic Information Systems and related computational technologies, statisticians are often required to analyze very large spatial datasets. This has generated substantial interest over the last decade, already too vast to be summarized here, in scalable methodologies for analyzing large spatial datasets. Scalable spatial process models have been found especially attractive due to their richness and flexibility and, particularly so in the Bayesian paradigm, due to their presence in hierarchical model settings. However, the vast majority of research articles present in this domain have been geared toward innovative theory or more complex model development. Very limited attention has been accorded to approaches for easily implementable scalable hierarchical models for the practicing scientist or spatial analyst. This article devises massively scalable Bayesian approaches that can rapidly deliver inference on spatial process that are practically indistinguishable from inference obtained using more expensive alternatives. A key emphasis is on implementation within very standard (modest) computing environments (e.g., a standard desktop or laptop) using easily available statistical software packages. Key insights are offered regarding assumptions and approximations concerning practical efficiency.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"12 3","pages":"197-209"},"PeriodicalIF":1.3,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11413","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10297504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Fused Lasso Regression for Identifying Differential Correlations in Brain Connectome Graphs. 融合Lasso回归识别脑连接组图的差异相关性。

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2018-10-01 Epub Date: 2018-07-11 DOI: 10.1002/sam.11382

Donghyeon Yu, Sang Han Lee, Johan Lim, Guanghua Xiao, R Cameron Craddock, Bharat B Biswal

In this paper, we propose a procedure to find differential edges between two graphs from high-dimensional data. We estimate two matrices of partial correlations and their differences by solving a penalized regression problem. We assume sparsity only on differences between two graphs, not graphs themselves. Thus, we impose an ℓ ₂ penalty on partial correlations and an ℓ ₁ penalty on their differences in the penalized regression problem. We apply the proposed procedure to finding differential functional connectivity between healthy individuals and Alzheimer's disease patients.

本文提出了一种求高维数据中两个图之间的微分边的方法。我们通过求解一个惩罚回归问题来估计两个偏相关矩阵及其差异。我们只在两个图之间的差异上假设稀疏性，而不是图本身。因此，在惩罚回归问题中，我们对偏相关性施加了一个l2惩罚，对它们的差异施加了一个l1惩罚。我们将提出的程序应用于寻找健康个体和阿尔茨海默病患者之间的差异功能连接。

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Statistical Analysis and Data Mining

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀