Distance metrics between statistical distributions are widely used as an efficient mean to aggregate/simplify the underlying probabilities, thus enabling high-level analyses. In this paper we investigate the collisions that can arise with such metrics, and a mitigation technique rooted on kernels. In detail, we first show that the existence of colliding functions (so-called iso-curves) is widespread across metrics and families of functions (e.g., gaussians, heavy-tailed). Later, we propose a solution based on kernels for augmenting distance metrics and summary statistics, thus avoiding collisions and highlighting semantically-relevant phenomena. This study is supported by a thorough theoretical evaluation of our solution against a large number of functions and metrics, complemented by a real-world evaluation carried out by applying our solution to an existing problem. Some further research venues are also discussed. The theoretical construction and the achieved results show the soundness, viability, and quality of our proposal that, other being interesting on its own, also paves the way for further research in the highlighted directions.
{"title":"Semantically-Aware Statistical Metrics via Weighting Kernels","authors":"S. Cresci, R. D. Pietro, M. Tesconi","doi":"10.1109/DSAA.2019.00019","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00019","url":null,"abstract":"Distance metrics between statistical distributions are widely used as an efficient mean to aggregate/simplify the underlying probabilities, thus enabling high-level analyses. In this paper we investigate the collisions that can arise with such metrics, and a mitigation technique rooted on kernels. In detail, we first show that the existence of colliding functions (so-called iso-curves) is widespread across metrics and families of functions (e.g., gaussians, heavy-tailed). Later, we propose a solution based on kernels for augmenting distance metrics and summary statistics, thus avoiding collisions and highlighting semantically-relevant phenomena. This study is supported by a thorough theoretical evaluation of our solution against a large number of functions and metrics, complemented by a real-world evaluation carried out by applying our solution to an existing problem. Some further research venues are also discussed. The theoretical construction and the achieved results show the soundness, viability, and quality of our proposal that, other being interesting on its own, also paves the way for further research in the highlighted directions.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116737912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The widespread usage of Machine Learning and Data Mining models in several key areas of our societies has raised serious concerns in terms of accountability and ability to justify and interpret the decisions of these models. This is even more relevant when models are too complex and often regarded as black boxes. In this paper we present several tools designed to help in understanding and explaining the reasons for the observed predictive performance of black box regression models. We describe, evaluate and propose several variants of Error Dependence Plots. These plots provide a visual display of the expected relationship between the prediction error of any model and the values of a predictor variable. They allow the end user to understand what to expect from the models given some concrete values of the predictor variables. These tools allow more accurate explanations on the conditions that may lead to some failures of the models. Moreover, our proposed extensions also provide a multivariate perspective of this analysis, and the ability to compare the behaviour of multiple models under different conditions. This comparative analysis empowers the end user with the ability to have a case-based analysis of the risks associated with different models, and thus select the model with lower expected risk for each test case, or even decide not to use any model because the expected error is unacceptable.
{"title":"Explaining the Performance of Black Box Regression Models","authors":"Inês Areosa, L. Torgo","doi":"10.1109/DSAA.2019.00025","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00025","url":null,"abstract":"The widespread usage of Machine Learning and Data Mining models in several key areas of our societies has raised serious concerns in terms of accountability and ability to justify and interpret the decisions of these models. This is even more relevant when models are too complex and often regarded as black boxes. In this paper we present several tools designed to help in understanding and explaining the reasons for the observed predictive performance of black box regression models. We describe, evaluate and propose several variants of Error Dependence Plots. These plots provide a visual display of the expected relationship between the prediction error of any model and the values of a predictor variable. They allow the end user to understand what to expect from the models given some concrete values of the predictor variables. These tools allow more accurate explanations on the conditions that may lead to some failures of the models. Moreover, our proposed extensions also provide a multivariate perspective of this analysis, and the ability to compare the behaviour of multiple models under different conditions. This comparative analysis empowers the end user with the ability to have a case-based analysis of the risks associated with different models, and thus select the model with lower expected risk for each test case, or even decide not to use any model because the expected error is unacceptable.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134445727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julianne Zech, F. Dale, L. Singh, Jamillah Williams, Naomi Mezey
While identifying those who are most vocal on social media movements can be straight-forward, finding hidden groups can be challenging. This poster presents a case study focused on the relationship between mentions of universities in the #MeToo Twitter conversation and policies universities have implemented with regards to harassment and assault. Preliminary results suggest that there is variation in terms of policies, resources and responses to sexual misconduct across campuses and that there is also variation in the number of mentions of different universities. However, there is not a clear relationship between policies and online discussion involving universities.
{"title":"Exploring the Relationship Between Conversation Using #MeToo and University Harassment Policies","authors":"Julianne Zech, F. Dale, L. Singh, Jamillah Williams, Naomi Mezey","doi":"10.1109/DSAA.2019.00083","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00083","url":null,"abstract":"While identifying those who are most vocal on social media movements can be straight-forward, finding hidden groups can be challenging. This poster presents a case study focused on the relationship between mentions of universities in the #MeToo Twitter conversation and policies universities have implemented with regards to harassment and assault. Preliminary results suggest that there is variation in terms of policies, resources and responses to sexual misconduct across campuses and that there is also variation in the number of mentions of different universities. However, there is not a clear relationship between policies and online discussion involving universities.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134089925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Brumbaugh, Atul S. Kale, Alfredo Luque, Bahador B. Nooraei, John Park, Krishna P. N. Puttaswamy, Kyle Schiller, E. Shapiro, Conglei Shi, Aaron Siegel, N. Simha, Mani Bhushan, Marie Sbrocca, Shi-Jing Yao, P. Yoon, Varant Zanoyan, Xiao-Han T. Zeng, Qiang Zhu, Andrew Cheong, Michelle Du, Jeff Feng, N. Handel, Andrew Hoh, J. Hone, Brad Hunter
With the increasing need to build systems and products powered by machine learning inside organizations, it is critical to have a platform that provides machine learning practitioners with a unified environment to easily prototype, deploy, and maintain their models at scale. However, due to the diversity of machine learning libraries, the inconsistency between environments, and various scalability requirement, there is no existing work to date that addresses all of these challenges. Here, we introduce Bighead, a framework-agnostic, end-to-end platform for machine learning. It offers a seamless user experience requiring only minimal efforts that span feature set management, prototyping, training, batch (offline) inference, real-time (online) inference, evaluation, and model lifecycle management. In contrast to existing platforms, it is designed to be highly versatile and extensible, and supports all major machine learning frameworks, rather than focusing on one particular framework. It ensures consistency across different environments and stages of the model lifecycle, as well as across data sources and transformations. It scales horizontally and elastically in response to the workload such as dataset size and throughput. Its components include a feature management framework, a model development toolkit, a lifecycle management service with UI, an offline training and inference engine, an online inference service, an interactive prototyping environment, and a Docker image customization tool. It is the first platform to offer a feature management component that is a general-purpose aggregation framework with lambda architecture and temporal joins. Bighead is deployed and widely adopted at Airbnb, and has enabled the data science and engineering teams to develop and deploy machine learning models in a timely and reliable manner. Bighead has shortened the time to deploy a new model from months to days, ensured the stability of the models in production, facilitated adoption of cutting-edge models, and enabled advanced machine learning based product features of the Airbnb platform. We present two use cases of productionizing models of computer vision and natural language processing.
{"title":"Bighead: A Framework-Agnostic, End-to-End Machine Learning Platform","authors":"E. Brumbaugh, Atul S. Kale, Alfredo Luque, Bahador B. Nooraei, John Park, Krishna P. N. Puttaswamy, Kyle Schiller, E. Shapiro, Conglei Shi, Aaron Siegel, N. Simha, Mani Bhushan, Marie Sbrocca, Shi-Jing Yao, P. Yoon, Varant Zanoyan, Xiao-Han T. Zeng, Qiang Zhu, Andrew Cheong, Michelle Du, Jeff Feng, N. Handel, Andrew Hoh, J. Hone, Brad Hunter","doi":"10.1109/DSAA.2019.00070","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00070","url":null,"abstract":"With the increasing need to build systems and products powered by machine learning inside organizations, it is critical to have a platform that provides machine learning practitioners with a unified environment to easily prototype, deploy, and maintain their models at scale. However, due to the diversity of machine learning libraries, the inconsistency between environments, and various scalability requirement, there is no existing work to date that addresses all of these challenges. Here, we introduce Bighead, a framework-agnostic, end-to-end platform for machine learning. It offers a seamless user experience requiring only minimal efforts that span feature set management, prototyping, training, batch (offline) inference, real-time (online) inference, evaluation, and model lifecycle management. In contrast to existing platforms, it is designed to be highly versatile and extensible, and supports all major machine learning frameworks, rather than focusing on one particular framework. It ensures consistency across different environments and stages of the model lifecycle, as well as across data sources and transformations. It scales horizontally and elastically in response to the workload such as dataset size and throughput. Its components include a feature management framework, a model development toolkit, a lifecycle management service with UI, an offline training and inference engine, an online inference service, an interactive prototyping environment, and a Docker image customization tool. It is the first platform to offer a feature management component that is a general-purpose aggregation framework with lambda architecture and temporal joins. Bighead is deployed and widely adopted at Airbnb, and has enabled the data science and engineering teams to develop and deploy machine learning models in a timely and reliable manner. Bighead has shortened the time to deploy a new model from months to days, ensured the stability of the models in production, facilitated adoption of cutting-edge models, and enabled advanced machine learning based product features of the Airbnb platform. We present two use cases of productionizing models of computer vision and natural language processing.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124672897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Message from the General/Logistics Chairs","authors":"","doi":"10.1109/dsaa.2019.00005","DOIUrl":"https://doi.org/10.1109/dsaa.2019.00005","url":null,"abstract":"","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124703643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Stefani, Leonhard F. Spiegelberg, E. Upfal, Tim Kraska
Recently, there have been several proposals to develop visual recommendation systems. The most advanced systems aim to recommend visualizations, which help users to find new correlations or identify an interesting deviation based on the current context of the user’s analysis. However, when recommending a visualization to a user, there is an inherent risk to visualize random fluctuations rather than solely true patterns: a problem largely ignored by current techniques. In this paper, we present VizCertify, a novel framework to improve the performance of visual recommendation systems by quantifying the statistical significance of recommended visualizations. The proposed methodology allows to control the probability of misleading visual recommendations using both classical statistical testing procedures and a novel application of the Vapnik Chervonenkis (VC) dimension towards visualization recommendation which results in an effective criterion to decide whether a recommendation corresponds to a true phenomenon or not.
{"title":"VizCertify: A Framework for Secure Visual Data Exploration","authors":"L. Stefani, Leonhard F. Spiegelberg, E. Upfal, Tim Kraska","doi":"10.1109/DSAA.2019.00039","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00039","url":null,"abstract":"Recently, there have been several proposals to develop visual recommendation systems. The most advanced systems aim to recommend visualizations, which help users to find new correlations or identify an interesting deviation based on the current context of the user’s analysis. However, when recommending a visualization to a user, there is an inherent risk to visualize random fluctuations rather than solely true patterns: a problem largely ignored by current techniques. In this paper, we present VizCertify, a novel framework to improve the performance of visual recommendation systems by quantifying the statistical significance of recommended visualizations. The proposed methodology allows to control the probability of misleading visual recommendations using both classical statistical testing procedures and a novel application of the Vapnik Chervonenkis (VC) dimension towards visualization recommendation which results in an effective criterion to decide whether a recommendation corresponds to a true phenomenon or not.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126155345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saiyedul Islam, S. Balasubramaniam, Poonam Goyal, Ankit Sultana, Lakshit Bhutani, S. Raje, Navneet Goyal
Big Data has significantly increased the dependence of data analytics community on High Performance Computing (HPC) systems. However, efficiently programming an HPC system is still a tedious task requiring specialized skills in parallelization and the use of platform-specific languages as well as mechanisms. We present a framework for quickly prototyping new/existing density-based clustering algorithms while obtaining low running times and high speedups via automatic parallelization. The user is required only to specify the sequential algorithm in a Domain Specific Language (DSL) for clustering at a very high level of abstraction. The parallelizing compiler for the DSL does the rest to leverage distributed systems - in particular, typical scale-out clusters made of commodity hardware. Our approach is based on recurring, parallelizable programming patterns known as Kernels, which are identified and parallelized by the compiler. We demonstrate the ease of programming and scalable performance for DBSCAN, SNN, and RECOME algorithms. We also establish that the proposed approach can achieve performance comparable to state-of-the-art manually parallelized implementations while requiring minimal programming effort that is several orders of magnitude smaller than those required on other parallel platforms like MPI/Spark.
大数据极大地增加了数据分析社区对高性能计算(HPC)系统的依赖。然而,高效地为HPC系统编程仍然是一项繁琐的任务,需要在并行化和使用特定于平台的语言以及机制方面的专业技能。我们提出了一个框架,用于快速原型化新的/现有的基于密度的聚类算法,同时通过自动并行化获得低运行时间和高速度。用户只需要用领域特定语言(Domain Specific Language, DSL)指定序列算法,以便在非常高的抽象级别上进行聚类。DSL的并行编译器会完成其余的工作,以利用分布式系统——特别是由普通硬件组成的典型横向扩展集群。我们的方法基于循环的、可并行的编程模式,称为内核,它由编译器识别和并行化。我们演示了DBSCAN、SNN和RECOME算法的编程便利性和可扩展性能。我们还确定,所提出的方法可以达到与最先进的手动并行实现相当的性能,同时需要最少的编程工作,比MPI/Spark等其他并行平台所需的编程工作小几个数量级。
{"title":"A Rapid Prototyping Approach for High Performance Density-Based Clustering","authors":"Saiyedul Islam, S. Balasubramaniam, Poonam Goyal, Ankit Sultana, Lakshit Bhutani, S. Raje, Navneet Goyal","doi":"10.1109/DSAA.2019.00041","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00041","url":null,"abstract":"Big Data has significantly increased the dependence of data analytics community on High Performance Computing (HPC) systems. However, efficiently programming an HPC system is still a tedious task requiring specialized skills in parallelization and the use of platform-specific languages as well as mechanisms. We present a framework for quickly prototyping new/existing density-based clustering algorithms while obtaining low running times and high speedups via automatic parallelization. The user is required only to specify the sequential algorithm in a Domain Specific Language (DSL) for clustering at a very high level of abstraction. The parallelizing compiler for the DSL does the rest to leverage distributed systems - in particular, typical scale-out clusters made of commodity hardware. Our approach is based on recurring, parallelizable programming patterns known as Kernels, which are identified and parallelized by the compiler. We demonstrate the ease of programming and scalable performance for DBSCAN, SNN, and RECOME algorithms. We also establish that the proposed approach can achieve performance comparable to state-of-the-art manually parallelized implementations while requiring minimal programming effort that is several orders of magnitude smaller than those required on other parallel platforms like MPI/Spark.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114210361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Péter Gáspár, Michal Kompan, Matej Koncal, M. Bieliková
Recommender systems generate items that should be interesting for the customers. However, recommenders usually fail in the cold-start scenario - when a new item or a new customer appears. In our work, we study the cold-start problem for a new customer. For a cold-start customer we find the most similar customers and use a “their” pre-trained collaborative filtering model to recommend. We compare several recommendation approaches and similarity metrics to analyze the accuracy and computational performance.
{"title":"Improving the Personalized Recommendation in the Cold-start Scenarios","authors":"Péter Gáspár, Michal Kompan, Matej Koncal, M. Bieliková","doi":"10.1109/DSAA.2019.00079","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00079","url":null,"abstract":"Recommender systems generate items that should be interesting for the customers. However, recommenders usually fail in the cold-start scenario - when a new item or a new customer appears. In our work, we study the cold-start problem for a new customer. For a cold-start customer we find the most similar customers and use a “their” pre-trained collaborative filtering model to recommend. We compare several recommendation approaches and similarity metrics to analyze the accuracy and computational performance.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114817929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tensor (multidimensional array) classification problem has become popular in modern applications such as computer vision and spatial-temporal data analysis. The Support Tensor Machine (STM) classifier, which is extended from support vector machine, takes tensor type data as predictors to predict the labels of the data. The distribution-free property of STM highlights its potential in handling different types of data applications. In this work, we provide a theoretical result for the universal consistency of STM. This result guarantees the solid generalization ability of STM with universal tensor based kernel functions. In addition, we give out a way of constructing universal kernel functions for tensor data, which may be helpful for other types of tensor based kernel methods.
{"title":"Universal Consistency of Support Tensor Machine","authors":"Peide Li, T. Maiti","doi":"10.1109/DSAA.2019.00080","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00080","url":null,"abstract":"Tensor (multidimensional array) classification problem has become popular in modern applications such as computer vision and spatial-temporal data analysis. The Support Tensor Machine (STM) classifier, which is extended from support vector machine, takes tensor type data as predictors to predict the labels of the data. The distribution-free property of STM highlights its potential in handling different types of data applications. In this work, we provide a theoretical result for the universal consistency of STM. This result guarantees the solid generalization ability of STM with universal tensor based kernel functions. In addition, we give out a way of constructing universal kernel functions for tensor data, which may be helpful for other types of tensor based kernel methods.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117348580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linking historical data longitudinally allows researchers to better characterize topics like population mobility, the impact of local / national events, and generational changes. The ideal linking process would involve subject matter experts with detailed information about each record, including any relationships to other records, however, this in-depth process is expensive and often infeasible. Record linkage is the process of identifying and labeling records corresponding to unique entities. These statistical models largely rely on pairwise comparisons, under-utilizing information about group structure and historical knowledge. Moreover, model performance can be limited by using labels of unknown certainty or origin. In record linkage, we are rarely given information about the number of labelers, how often they agreed, or the labeling process itself. Understanding how and why records are linked together for the dual purposes of gaining insights into the human decision-making process and improving record linkage models is an exciting, high impact area of research. We present an interactive labeling interface for use at the initial stages of the (potentially crowdsourced) record linkage process. The interface captures labeled records while tracking the labeler actions. The interface allows labelers to view and interact with the records at both the individual and group level, thereby providing nested labels. We simultaneously receive information about the label certainty and the labeler's decision-making process via repeated label instances and click-streams. We demonstrate the utility of this interface on the recently released, unlabeled 1901 and 1911 Ireland Census records and discuss the benefits of richer labels.
{"title":"A Novel Record Linkage Interface That Incorporates Group Structure to Rapidly Collect Richer Labels","authors":"K. Frisoli, Benjamin LeRoy, Rebecca Nugent","doi":"10.1109/DSAA.2019.00073","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00073","url":null,"abstract":"Linking historical data longitudinally allows researchers to better characterize topics like population mobility, the impact of local / national events, and generational changes. The ideal linking process would involve subject matter experts with detailed information about each record, including any relationships to other records, however, this in-depth process is expensive and often infeasible. Record linkage is the process of identifying and labeling records corresponding to unique entities. These statistical models largely rely on pairwise comparisons, under-utilizing information about group structure and historical knowledge. Moreover, model performance can be limited by using labels of unknown certainty or origin. In record linkage, we are rarely given information about the number of labelers, how often they agreed, or the labeling process itself. Understanding how and why records are linked together for the dual purposes of gaining insights into the human decision-making process and improving record linkage models is an exciting, high impact area of research. We present an interactive labeling interface for use at the initial stages of the (potentially crowdsourced) record linkage process. The interface captures labeled records while tracking the labeler actions. The interface allows labelers to view and interact with the records at both the individual and group level, thereby providing nested labels. We simultaneously receive information about the label certainty and the labeler's decision-making process via repeated label instances and click-streams. We demonstrate the utility of this interface on the recently released, unlabeled 1901 and 1911 Ireland Census records and discuss the benefits of richer labels.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116802698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}