2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)最新文献

英文中文

Semantically-Aware Statistical Metrics via Weighting Kernels 基于加权核的语义感知统计度量

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00019

S. Cresci, R. D. Pietro, M. Tesconi

Distance metrics between statistical distributions are widely used as an efficient mean to aggregate/simplify the underlying probabilities, thus enabling high-level analyses. In this paper we investigate the collisions that can arise with such metrics, and a mitigation technique rooted on kernels. In detail, we first show that the existence of colliding functions (so-called iso-curves) is widespread across metrics and families of functions (e.g., gaussians, heavy-tailed). Later, we propose a solution based on kernels for augmenting distance metrics and summary statistics, thus avoiding collisions and highlighting semantically-relevant phenomena. This study is supported by a thorough theoretical evaluation of our solution against a large number of functions and metrics, complemented by a real-world evaluation carried out by applying our solution to an existing problem. Some further research venues are also discussed. The theoretical construction and the achieved results show the soundness, viability, and quality of our proposal that, other being interesting on its own, also paves the way for further research in the highlighted directions.

统计分布之间的距离度量被广泛用作聚合/简化潜在概率的有效方法，从而实现高级分析。在本文中，我们研究了这些度量可能产生的碰撞，以及基于核的缓解技术。详细地说，我们首先表明，碰撞函数(所谓的等曲线)的存在在度量和函数族(例如，高斯函数，重尾函数)中广泛存在。随后，我们提出了一种基于核的解决方案，用于增加距离度量和汇总统计，从而避免冲突并突出语义相关现象。这项研究得到了对我们的解决方案针对大量功能和指标的全面理论评估的支持，并通过将我们的解决方案应用于现有问题进行了实际评估。并对今后的研究方向进行了讨论。理论构建和取得的结果显示了我们的建议的合理性，可行性和质量，除了本身有趣之外，还为重点方向的进一步研究铺平了道路。

{"title":"Semantically-Aware Statistical Metrics via Weighting Kernels","authors":"S. Cresci, R. D. Pietro, M. Tesconi","doi":"10.1109/DSAA.2019.00019","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00019","url":null,"abstract":"Distance metrics between statistical distributions are widely used as an efficient mean to aggregate/simplify the underlying probabilities, thus enabling high-level analyses. In this paper we investigate the collisions that can arise with such metrics, and a mitigation technique rooted on kernels. In detail, we first show that the existence of colliding functions (so-called iso-curves) is widespread across metrics and families of functions (e.g., gaussians, heavy-tailed). Later, we propose a solution based on kernels for augmenting distance metrics and summary statistics, thus avoiding collisions and highlighting semantically-relevant phenomena. This study is supported by a thorough theoretical evaluation of our solution against a large number of functions and metrics, complemented by a real-world evaluation carried out by applying our solution to an existing problem. Some further research venues are also discussed. The theoretical construction and the achieved results show the soundness, viability, and quality of our proposal that, other being interesting on its own, also paves the way for further research in the highlighted directions.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116737912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Explaining the Performance of Black Box Regression Models 解释黑箱回归模型的性能

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00025

Inês Areosa, L. Torgo

The widespread usage of Machine Learning and Data Mining models in several key areas of our societies has raised serious concerns in terms of accountability and ability to justify and interpret the decisions of these models. This is even more relevant when models are too complex and often regarded as black boxes. In this paper we present several tools designed to help in understanding and explaining the reasons for the observed predictive performance of black box regression models. We describe, evaluate and propose several variants of Error Dependence Plots. These plots provide a visual display of the expected relationship between the prediction error of any model and the values of a predictor variable. They allow the end user to understand what to expect from the models given some concrete values of the predictor variables. These tools allow more accurate explanations on the conditions that may lead to some failures of the models. Moreover, our proposed extensions also provide a multivariate perspective of this analysis, and the ability to compare the behaviour of multiple models under different conditions. This comparative analysis empowers the end user with the ability to have a case-based analysis of the risks associated with different models, and thus select the model with lower expected risk for each test case, or even decide not to use any model because the expected error is unacceptable.

机器学习和数据挖掘模型在我们社会的几个关键领域的广泛使用，在问责制和证明和解释这些模型的决策的能力方面引起了严重的关注。当模型过于复杂且经常被视为黑盒时，这一点甚至更为重要。在本文中，我们提出了几个工具，旨在帮助理解和解释观察到的黑箱回归模型预测性能的原因。我们描述、评估并提出了误差依赖图的几种变体。这些图直观地显示了任何模型的预测误差与预测变量的值之间的预期关系。它们允许最终用户在给定预测变量的一些具体值的情况下理解从模型中期望得到什么。这些工具允许对可能导致模型失效的条件进行更准确的解释。此外，我们提出的扩展还提供了该分析的多变量视角，以及比较不同条件下多个模型行为的能力。这种比较分析使最终用户能够对与不同模型相关的风险进行基于案例的分析，从而为每个测试用例选择具有较低预期风险的模型，或者甚至决定不使用任何模型，因为预期的错误是不可接受的。

{"title":"Explaining the Performance of Black Box Regression Models","authors":"Inês Areosa, L. Torgo","doi":"10.1109/DSAA.2019.00025","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00025","url":null,"abstract":"The widespread usage of Machine Learning and Data Mining models in several key areas of our societies has raised serious concerns in terms of accountability and ability to justify and interpret the decisions of these models. This is even more relevant when models are too complex and often regarded as black boxes. In this paper we present several tools designed to help in understanding and explaining the reasons for the observed predictive performance of black box regression models. We describe, evaluate and propose several variants of Error Dependence Plots. These plots provide a visual display of the expected relationship between the prediction error of any model and the values of a predictor variable. They allow the end user to understand what to expect from the models given some concrete values of the predictor variables. These tools allow more accurate explanations on the conditions that may lead to some failures of the models. Moreover, our proposed extensions also provide a multivariate perspective of this analysis, and the ability to compare the behaviour of multiple models under different conditions. This comparative analysis empowers the end user with the ability to have a case-based analysis of the risks associated with different models, and thus select the model with lower expected risk for each test case, or even decide not to use any model because the expected error is unacceptable.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134445727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Exploring the Relationship Between Conversation Using #MeToo and University Harassment Policies 探索使用#MeToo对话与大学骚扰政策之间的关系

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00083

Julianne Zech, F. Dale, L. Singh, Jamillah Williams, Naomi Mezey

While identifying those who are most vocal on social media movements can be straight-forward, finding hidden groups can be challenging. This poster presents a case study focused on the relationship between mentions of universities in the #MeToo Twitter conversation and policies universities have implemented with regards to harassment and assault. Preliminary results suggest that there is variation in terms of policies, resources and responses to sexual misconduct across campuses and that there is also variation in the number of mentions of different universities. However, there is not a clear relationship between policies and online discussion involving universities.

虽然确定那些在社交媒体运动中最直言不讳的人可能很简单，但找到隐藏的群体可能很有挑战性。这张海报提供了一个案例研究，重点关注#MeToo推特对话中提到的大学与大学在性骚扰和性侵犯方面实施的政策之间的关系。初步结果表明，各个校园在政策、资源和应对不当性行为方面存在差异，不同大学的提及次数也存在差异。然而，政策与大学网络讨论之间并没有明确的关系。

引用次数: 0

Bighead: A Framework-Agnostic, End-to-End Machine Learning Platform high - head:一个框架无关的端到端机器学习平台

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00070

E. Brumbaugh, Atul S. Kale, Alfredo Luque, Bahador B. Nooraei, John Park, Krishna P. N. Puttaswamy, Kyle Schiller, E. Shapiro, Conglei Shi, Aaron Siegel, N. Simha, Mani Bhushan, Marie Sbrocca, Shi-Jing Yao, P. Yoon, Varant Zanoyan, Xiao-Han T. Zeng, Qiang Zhu, Andrew Cheong, Michelle Du, Jeff Feng, N. Handel, Andrew Hoh, J. Hone, Brad Hunter

With the increasing need to build systems and products powered by machine learning inside organizations, it is critical to have a platform that provides machine learning practitioners with a unified environment to easily prototype, deploy, and maintain their models at scale. However, due to the diversity of machine learning libraries, the inconsistency between environments, and various scalability requirement, there is no existing work to date that addresses all of these challenges. Here, we introduce Bighead, a framework-agnostic, end-to-end platform for machine learning. It offers a seamless user experience requiring only minimal efforts that span feature set management, prototyping, training, batch (offline) inference, real-time (online) inference, evaluation, and model lifecycle management. In contrast to existing platforms, it is designed to be highly versatile and extensible, and supports all major machine learning frameworks, rather than focusing on one particular framework. It ensures consistency across different environments and stages of the model lifecycle, as well as across data sources and transformations. It scales horizontally and elastically in response to the workload such as dataset size and throughput. Its components include a feature management framework, a model development toolkit, a lifecycle management service with UI, an offline training and inference engine, an online inference service, an interactive prototyping environment, and a Docker image customization tool. It is the first platform to offer a feature management component that is a general-purpose aggregation framework with lambda architecture and temporal joins. Bighead is deployed and widely adopted at Airbnb, and has enabled the data science and engineering teams to develop and deploy machine learning models in a timely and reliable manner. Bighead has shortened the time to deploy a new model from months to days, ensured the stability of the models in production, facilitated adoption of cutting-edge models, and enabled advanced machine learning based product features of the Airbnb platform. We present two use cases of productionizing models of computer vision and natural language processing.

随着在组织内部构建由机器学习驱动的系统和产品的需求不断增加，拥有一个为机器学习从业者提供统一环境的平台以轻松地进行原型设计、部署和大规模维护其模型至关重要。然而，由于机器学习库的多样性、环境之间的不一致性以及各种可扩展性需求，迄今为止还没有现有的工作能够解决所有这些挑战。在这里，我们介绍Bighead，一个与框架无关的端到端机器学习平台。它提供了无缝的用户体验，只需要最小的努力，跨越特性集管理、原型设计、培训、批处理(离线)推理、实时(在线)推理、评估和模型生命周期管理。与现有平台相比，它被设计为高度通用和可扩展的，并支持所有主要的机器学习框架，而不是专注于一个特定的框架。它确保了模型生命周期的不同环境和阶段，以及数据源和转换之间的一致性。它可以根据工作负载(如数据集大小和吞吐量)进行水平和弹性扩展。它的组件包括一个特性管理框架、一个模型开发工具包、一个带有UI的生命周期管理服务、一个离线训练和推理引擎、一个在线推理服务、一个交互式原型环境和一个Docker映像定制工具。它是第一个提供特性管理组件的平台，该组件是一个具有lambda架构和时态连接的通用聚合框架。Bighead在Airbnb得到了广泛的部署和采用，它使数据科学和工程团队能够及时、可靠地开发和部署机器学习模型。Bighead将部署新模型的时间从几个月缩短到几天，确保了模型在生产中的稳定性，促进了前沿模型的采用，并实现了Airbnb平台基于机器学习的先进产品功能。我们提出了计算机视觉和自然语言处理的产品化模型的两个用例。

{"title":"Bighead: A Framework-Agnostic, End-to-End Machine Learning Platform","authors":"E. Brumbaugh, Atul S. Kale, Alfredo Luque, Bahador B. Nooraei, John Park, Krishna P. N. Puttaswamy, Kyle Schiller, E. Shapiro, Conglei Shi, Aaron Siegel, N. Simha, Mani Bhushan, Marie Sbrocca, Shi-Jing Yao, P. Yoon, Varant Zanoyan, Xiao-Han T. Zeng, Qiang Zhu, Andrew Cheong, Michelle Du, Jeff Feng, N. Handel, Andrew Hoh, J. Hone, Brad Hunter","doi":"10.1109/DSAA.2019.00070","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00070","url":null,"abstract":"With the increasing need to build systems and products powered by machine learning inside organizations, it is critical to have a platform that provides machine learning practitioners with a unified environment to easily prototype, deploy, and maintain their models at scale. However, due to the diversity of machine learning libraries, the inconsistency between environments, and various scalability requirement, there is no existing work to date that addresses all of these challenges. Here, we introduce Bighead, a framework-agnostic, end-to-end platform for machine learning. It offers a seamless user experience requiring only minimal efforts that span feature set management, prototyping, training, batch (offline) inference, real-time (online) inference, evaluation, and model lifecycle management. In contrast to existing platforms, it is designed to be highly versatile and extensible, and supports all major machine learning frameworks, rather than focusing on one particular framework. It ensures consistency across different environments and stages of the model lifecycle, as well as across data sources and transformations. It scales horizontally and elastically in response to the workload such as dataset size and throughput. Its components include a feature management framework, a model development toolkit, a lifecycle management service with UI, an offline training and inference engine, an online inference service, an interactive prototyping environment, and a Docker image customization tool. It is the first platform to offer a feature management component that is a general-purpose aggregation framework with lambda architecture and temporal joins. Bighead is deployed and widely adopted at Airbnb, and has enabled the data science and engineering teams to develop and deploy machine learning models in a timely and reliable manner. Bighead has shortened the time to deploy a new model from months to days, ensured the stability of the models in production, facilitated adoption of cutting-edge models, and enabled advanced machine learning based product features of the Airbnb platform. We present two use cases of productionizing models of computer vision and natural language processing.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124672897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Message from the General/Logistics Chairs 总务/后勤主席的致辞

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/dsaa.2019.00005

引用次数: 0

VizCertify: A Framework for Secure Visual Data Exploration VizCertify:一个安全可视化数据探索框架

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00039

L. Stefani, Leonhard F. Spiegelberg, E. Upfal, Tim Kraska

Recently, there have been several proposals to develop visual recommendation systems. The most advanced systems aim to recommend visualizations, which help users to find new correlations or identify an interesting deviation based on the current context of the user’s analysis. However, when recommending a visualization to a user, there is an inherent risk to visualize random fluctuations rather than solely true patterns: a problem largely ignored by current techniques. In this paper, we present VizCertify, a novel framework to improve the performance of visual recommendation systems by quantifying the statistical significance of recommended visualizations. The proposed methodology allows to control the probability of misleading visual recommendations using both classical statistical testing procedures and a novel application of the Vapnik Chervonenkis (VC) dimension towards visualization recommendation which results in an effective criterion to decide whether a recommendation corresponds to a true phenomenon or not.

最近，有几个关于开发视觉推荐系统的建议。最先进的系统旨在推荐可视化，这可以帮助用户找到新的相关性，或者根据用户分析的当前上下文识别有趣的偏差。然而，当向用户推荐可视化时，存在一种固有的风险，即可视化随机波动而不是完全真实的模式:当前技术在很大程度上忽略了这个问题。在本文中，我们提出了VizCertify，一个新的框架，通过量化推荐的可视化的统计显著性来提高视觉推荐系统的性能。所提出的方法允许使用经典的统计测试程序和Vapnik Chervonenkis (VC)维对可视化推荐的新应用来控制误导性视觉推荐的概率，从而产生一个有效的标准来决定推荐是否符合真实现象。

引用次数: 0

A Rapid Prototyping Approach for High Performance Density-Based Clustering 高性能基于密度的聚类的快速原型方法

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00041

Saiyedul Islam, S. Balasubramaniam, Poonam Goyal, Ankit Sultana, Lakshit Bhutani, S. Raje, Navneet Goyal

Big Data has significantly increased the dependence of data analytics community on High Performance Computing (HPC) systems. However, efficiently programming an HPC system is still a tedious task requiring specialized skills in parallelization and the use of platform-specific languages as well as mechanisms. We present a framework for quickly prototyping new/existing density-based clustering algorithms while obtaining low running times and high speedups via automatic parallelization. The user is required only to specify the sequential algorithm in a Domain Specific Language (DSL) for clustering at a very high level of abstraction. The parallelizing compiler for the DSL does the rest to leverage distributed systems - in particular, typical scale-out clusters made of commodity hardware. Our approach is based on recurring, parallelizable programming patterns known as Kernels, which are identified and parallelized by the compiler. We demonstrate the ease of programming and scalable performance for DBSCAN, SNN, and RECOME algorithms. We also establish that the proposed approach can achieve performance comparable to state-of-the-art manually parallelized implementations while requiring minimal programming effort that is several orders of magnitude smaller than those required on other parallel platforms like MPI/Spark.

大数据极大地增加了数据分析社区对高性能计算(HPC)系统的依赖。然而，高效地为HPC系统编程仍然是一项繁琐的任务，需要在并行化和使用特定于平台的语言以及机制方面的专业技能。我们提出了一个框架，用于快速原型化新的/现有的基于密度的聚类算法，同时通过自动并行化获得低运行时间和高速度。用户只需要用领域特定语言(Domain Specific Language, DSL)指定序列算法，以便在非常高的抽象级别上进行聚类。DSL的并行编译器会完成其余的工作，以利用分布式系统——特别是由普通硬件组成的典型横向扩展集群。我们的方法基于循环的、可并行的编程模式，称为内核，它由编译器识别和并行化。我们演示了DBSCAN、SNN和RECOME算法的编程便利性和可扩展性能。我们还确定，所提出的方法可以达到与最先进的手动并行实现相当的性能，同时需要最少的编程工作，比MPI/Spark等其他并行平台所需的编程工作小几个数量级。

{"title":"A Rapid Prototyping Approach for High Performance Density-Based Clustering","authors":"Saiyedul Islam, S. Balasubramaniam, Poonam Goyal, Ankit Sultana, Lakshit Bhutani, S. Raje, Navneet Goyal","doi":"10.1109/DSAA.2019.00041","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00041","url":null,"abstract":"Big Data has significantly increased the dependence of data analytics community on High Performance Computing (HPC) systems. However, efficiently programming an HPC system is still a tedious task requiring specialized skills in parallelization and the use of platform-specific languages as well as mechanisms. We present a framework for quickly prototyping new/existing density-based clustering algorithms while obtaining low running times and high speedups via automatic parallelization. The user is required only to specify the sequential algorithm in a Domain Specific Language (DSL) for clustering at a very high level of abstraction. The parallelizing compiler for the DSL does the rest to leverage distributed systems - in particular, typical scale-out clusters made of commodity hardware. Our approach is based on recurring, parallelizable programming patterns known as Kernels, which are identified and parallelized by the compiler. We demonstrate the ease of programming and scalable performance for DBSCAN, SNN, and RECOME algorithms. We also establish that the proposed approach can achieve performance comparable to state-of-the-art manually parallelized implementations while requiring minimal programming effort that is several orders of magnitude smaller than those required on other parallel platforms like MPI/Spark.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114210361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Improving the Personalized Recommendation in the Cold-start Scenarios 改进冷启动场景下的个性化推荐

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00079

Péter Gáspár, Michal Kompan, Matej Koncal, M. Bieliková

Recommender systems generate items that should be interesting for the customers. However, recommenders usually fail in the cold-start scenario - when a new item or a new customer appears. In our work, we study the cold-start problem for a new customer. For a cold-start customer we find the most similar customers and use a “their” pre-trained collaborative filtering model to recommend. We compare several recommendation approaches and similarity metrics to analyze the accuracy and computational performance.

推荐系统会生成客户感兴趣的项目。然而，当出现新商品或新客户时，推荐程序通常会在冷启动场景中失败。在我们的工作中，我们研究了一个新客户的冷启动问题。对于冷启动客户，我们找到最相似的客户，并使用“他们”预训练的协同过滤模型进行推荐。我们比较了几种推荐方法和相似度指标来分析准确性和计算性能。

引用次数: 3

Universal Consistency of Support Tensor Machine 支持张量机的通用一致性

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00080

Peide Li, T. Maiti

Tensor (multidimensional array) classification problem has become popular in modern applications such as computer vision and spatial-temporal data analysis. The Support Tensor Machine (STM) classifier, which is extended from support vector machine, takes tensor type data as predictors to predict the labels of the data. The distribution-free property of STM highlights its potential in handling different types of data applications. In this work, we provide a theoretical result for the universal consistency of STM. This result guarantees the solid generalization ability of STM with universal tensor based kernel functions. In addition, we give out a way of constructing universal kernel functions for tensor data, which may be helpful for other types of tensor based kernel methods.

张量(多维数组)分类问题在计算机视觉和时空数据分析等现代应用中得到广泛应用。支持张量机(Support Tensor Machine, STM)分类器是在支持向量机的基础上扩展而来的，它以张量型数据作为预测器来预测数据的标签。STM的无分布特性突出了它在处理不同类型数据应用程序方面的潜力。在这项工作中，我们为STM的普遍一致性提供了一个理论结果。这一结果保证了基于泛张量核函数的STM具有可靠的泛化能力。此外，我们还给出了一种构造张量数据通用核函数的方法，这对其他类型的基于张量的核方法也有一定的借鉴意义。

引用次数: 5

A Novel Record Linkage Interface That Incorporates Group Structure to Rapidly Collect Richer Labels 一种采用分组结构的记录链接接口，可快速收集更丰富的标签

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2019-10-01 DOI: 10.1109/DSAA.2019.00073

K. Frisoli, Benjamin LeRoy, Rebecca Nugent

Linking historical data longitudinally allows researchers to better characterize topics like population mobility, the impact of local / national events, and generational changes. The ideal linking process would involve subject matter experts with detailed information about each record, including any relationships to other records, however, this in-depth process is expensive and often infeasible. Record linkage is the process of identifying and labeling records corresponding to unique entities. These statistical models largely rely on pairwise comparisons, under-utilizing information about group structure and historical knowledge. Moreover, model performance can be limited by using labels of unknown certainty or origin. In record linkage, we are rarely given information about the number of labelers, how often they agreed, or the labeling process itself. Understanding how and why records are linked together for the dual purposes of gaining insights into the human decision-making process and improving record linkage models is an exciting, high impact area of research. We present an interactive labeling interface for use at the initial stages of the (potentially crowdsourced) record linkage process. The interface captures labeled records while tracking the labeler actions. The interface allows labelers to view and interact with the records at both the individual and group level, thereby providing nested labels. We simultaneously receive information about the label certainty and the labeler's decision-making process via repeated label instances and click-streams. We demonstrate the utility of this interface on the recently released, unlabeled 1901 and 1911 Ireland Census records and discuss the benefits of richer labels.

将历史数据纵向地联系起来，可以让研究人员更好地描述人口流动、地方/国家事件的影响以及代际变化等主题。理想的链接过程将涉及具有关于每个记录的详细信息的主题专家，包括与其他记录的任何关系，然而，这种深入的过程是昂贵的，并且通常是不可行的。记录链接是识别和标记与唯一实体相对应的记录的过程。这些统计模型很大程度上依赖于两两比较，没有充分利用群体结构和历史知识的信息。此外，使用未知确定性或来源的标签可能会限制模型的性能。在记录链接中，我们很少得到关于贴标者的数量、他们同意的频率或贴标过程本身的信息。为了了解人类决策过程和改进记录联系模型的双重目的，了解记录如何以及为什么联系在一起是一个令人兴奋的、高影响力的研究领域。我们提出了一个交互式标签界面，用于(潜在的众包)记录链接过程的初始阶段。接口在跟踪标记器操作的同时捕获标记的记录。该接口允许标签者在个人和组级别上查看记录并与之交互，从而提供嵌套标签。我们同时通过重复的标签实例和点击流接收关于标签确定性和标签发布者的决策过程的信息。我们在最近发布的未标记的1901年和1911年爱尔兰人口普查记录上展示了该接口的实用性，并讨论了更丰富标签的好处。

{"title":"A Novel Record Linkage Interface That Incorporates Group Structure to Rapidly Collect Richer Labels","authors":"K. Frisoli, Benjamin LeRoy, Rebecca Nugent","doi":"10.1109/DSAA.2019.00073","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00073","url":null,"abstract":"Linking historical data longitudinally allows researchers to better characterize topics like population mobility, the impact of local / national events, and generational changes. The ideal linking process would involve subject matter experts with detailed information about each record, including any relationships to other records, however, this in-depth process is expensive and often infeasible. Record linkage is the process of identifying and labeling records corresponding to unique entities. These statistical models largely rely on pairwise comparisons, under-utilizing information about group structure and historical knowledge. Moreover, model performance can be limited by using labels of unknown certainty or origin. In record linkage, we are rarely given information about the number of labelers, how often they agreed, or the labeling process itself. Understanding how and why records are linked together for the dual purposes of gaining insights into the human decision-making process and improving record linkage models is an exciting, high impact area of research. We present an interactive labeling interface for use at the initial stages of the (potentially crowdsourced) record linkage process. The interface captures labeled records while tracking the labeler actions. The interface allows labelers to view and interact with the records at both the individual and group level, thereby providing nested labels. We simultaneously receive information about the label certainty and the labeler's decision-making process via repeated label instances and click-streams. We demonstrate the utility of this interface on the recently released, unlabeled 1901 and 1911 Ireland Census records and discuss the benefits of richer labels.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116802698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀