Record linkage (RL) is the process of identifying matching records from different databases that refer to the same entity. It is common that the attribute values of records that belong to the same entity do evolve over time, for example people can change their surname or address. Therefore, to identify the records that refer to the same entity over time, RL should make use of temporal information such as the time-stamp of when a record was created and/or update last. However, if RL needs to be conducted on information about people, due to privacy and confidentiality concerns organizations are often not willing or allowed to share sensitive data in their databases, such as personal medical records, or location and financial details, with other organizations. This paper is the first to propose a privacy-preserving temporal record linkage (PPTRL) protocol that can link records across different databases while ensuring the privacy of the sensitive data in these databases. We propose a novel protocol based on Bloom filter encoding which incorporates the temporal information available in records during the linkage process. Our approach uses homomorphic encryption to securely calculate the probabilities of entities changing attribute values in their records over a period of time. Based on these probabilities we generate a set of masking Bloom filters to adjust the similarities between record pairs. We provide a theoretical analysis of the complexity and privacy of our technique and conduct an empirical study on large real databases containing several millions of records. The experimental results show that our approach can achieve better linkage quality compared to non-temporal PPRL while providing privacy to individuals in the databases that are being linked.
{"title":"Privacy-Preserving Temporal Record Linkage","authors":"Thilina Ranbaduge, P. Christen","doi":"10.1109/ICDM.2018.00053","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00053","url":null,"abstract":"Record linkage (RL) is the process of identifying matching records from different databases that refer to the same entity. It is common that the attribute values of records that belong to the same entity do evolve over time, for example people can change their surname or address. Therefore, to identify the records that refer to the same entity over time, RL should make use of temporal information such as the time-stamp of when a record was created and/or update last. However, if RL needs to be conducted on information about people, due to privacy and confidentiality concerns organizations are often not willing or allowed to share sensitive data in their databases, such as personal medical records, or location and financial details, with other organizations. This paper is the first to propose a privacy-preserving temporal record linkage (PPTRL) protocol that can link records across different databases while ensuring the privacy of the sensitive data in these databases. We propose a novel protocol based on Bloom filter encoding which incorporates the temporal information available in records during the linkage process. Our approach uses homomorphic encryption to securely calculate the probabilities of entities changing attribute values in their records over a period of time. Based on these probabilities we generate a set of masking Bloom filters to adjust the similarities between record pairs. We provide a theoretical analysis of the complexity and privacy of our technique and conduct an empirical study on large real databases containing several millions of records. The experimental results show that our approach can achieve better linkage quality compared to non-temporal PPRL while providing privacy to individuals in the databases that are being linked.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132982675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fraud detection is of great importance because fraudulent behaviors may mislead consumers or bring huge losses to enterprises. Due to the lockstep feature of fraudulent behaviors, fraud detection problem can be viewed as finding suspicious dense blocks in the attributed bipartite graph. In reality, existing attribute-based methods are not adversarially robust, because fraudsters can take some camouflage actions to cover their behavior attributes as normal. More importantly, existing structural information based methods only consider shallow topology structure, making their effectiveness sensitive to the density of suspicious blocks. In this paper, we propose a novel deep structure learning model named DeepFD to differentiate normal users and suspicious users. DeepFD can preserve the non-linear graph structure and user behavior information simultaneously. Experimental results on different types of datasets demonstrate that DeepFD outperforms the state-of-the-art baselines.
{"title":"Deep Structure Learning for Fraud Detection","authors":"Haibo Wang, Chuan Zhou, Jia Wu, Weizhen Dang, Xingquan Zhu, Jilong Wang","doi":"10.1109/ICDM.2018.00072","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00072","url":null,"abstract":"Fraud detection is of great importance because fraudulent behaviors may mislead consumers or bring huge losses to enterprises. Due to the lockstep feature of fraudulent behaviors, fraud detection problem can be viewed as finding suspicious dense blocks in the attributed bipartite graph. In reality, existing attribute-based methods are not adversarially robust, because fraudsters can take some camouflage actions to cover their behavior attributes as normal. More importantly, existing structural information based methods only consider shallow topology structure, making their effectiveness sensitive to the density of suspicious blocks. In this paper, we propose a novel deep structure learning model named DeepFD to differentiate normal users and suspicious users. DeepFD can preserve the non-linear graph structure and user behavior information simultaneously. Experimental results on different types of datasets demonstrate that DeepFD outperforms the state-of-the-art baselines.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131753658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huifang Ma, Di Zhang, Weizhong Zhao, Yanru Wang, Zhongzhi Shi
Recommending valuable contents for microblog users is an important way to improve users' experiences. As high quality descriptors of user semantics, tags have always been used to represent users' interests or attributes. In this work, we propose a microblog recommendation approach via hypergraph random walk tag expansion and user social relation. More specifically, microblogs are considered as hyperedges and terms are taken as hypervertexs for each user, and the weighting strategies for both hyperedges and hypervertexs are established. Random walk is performed on the weighted hypergraph to obtain a number of terms as tags for users. And then the tag similarity matrix and the user-tag matrix can be constructed based on tag probability correlations and weight of each tag. Besides, the significance of user social relation is also considered for recommendation. Moreover, an iterative updating scheme is developed to get the final user-tag matrix for computing the similarities between microblogs and users. Experimental results show that the algorithm is effective in microblog recommendation.
{"title":"Leveraging Hypergraph Random Walk Tag Expansion and User Social Relation for Microblog Recommendation","authors":"Huifang Ma, Di Zhang, Weizhong Zhao, Yanru Wang, Zhongzhi Shi","doi":"10.1109/ICDM.2018.00152","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00152","url":null,"abstract":"Recommending valuable contents for microblog users is an important way to improve users' experiences. As high quality descriptors of user semantics, tags have always been used to represent users' interests or attributes. In this work, we propose a microblog recommendation approach via hypergraph random walk tag expansion and user social relation. More specifically, microblogs are considered as hyperedges and terms are taken as hypervertexs for each user, and the weighting strategies for both hyperedges and hypervertexs are established. Random walk is performed on the weighted hypergraph to obtain a number of terms as tags for users. And then the tag similarity matrix and the user-tag matrix can be constructed based on tag probability correlations and weight of each tag. Besides, the significance of user social relation is also considered for recommendation. Moreover, an iterative updating scheme is developed to get the final user-tag matrix for computing the similarities between microblogs and users. Experimental results show that the algorithm is effective in microblog recommendation.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115212088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keyphrases have been widely used in large document collections for providing a concise summary of document content. While significant efforts have been made on the task of automatic keyphrase extraction, existing methods have challenges in training a robust supervised model when there are insufficient labeled data in the resource-poor domains. To this end, in this paper, we propose a novel Topic-based Adversarial Neural Network (TANN) method, which aims at exploiting the unlabeled data in the target domain and the data in the resource-rich source domain. Specifically, we first explicitly incorporate the global topic information into the document representation using a topic correlation layer. Then, domain-invariant features are learned to allow the efficient transfer from the source domain to the target by utilizing adversarial training on the topic-based representation. Meanwhile, to balance the adversarial training and preserve the domain-private features in the target domain, we reconstruct the target data from both forward and backward directions. Finally, based on the learned features, keyphrase are extracted using a tagging method. Experiments on two realworld cross-domain scenarios demonstrate that our method can significantly improve the performance of keyphrase extraction on unlabeled or insufficiently labeled target domain.
{"title":"Exploiting Topic-Based Adversarial Neural Network for Cross-Domain Keyphrase Extraction","authors":"Yanan Wang, Qi Liu, Chuan Qin, Tong Xu, Yijun Wang, Enhong Chen, Hui Xiong","doi":"10.1109/ICDM.2018.00075","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00075","url":null,"abstract":"Keyphrases have been widely used in large document collections for providing a concise summary of document content. While significant efforts have been made on the task of automatic keyphrase extraction, existing methods have challenges in training a robust supervised model when there are insufficient labeled data in the resource-poor domains. To this end, in this paper, we propose a novel Topic-based Adversarial Neural Network (TANN) method, which aims at exploiting the unlabeled data in the target domain and the data in the resource-rich source domain. Specifically, we first explicitly incorporate the global topic information into the document representation using a topic correlation layer. Then, domain-invariant features are learned to allow the efficient transfer from the source domain to the target by utilizing adversarial training on the topic-based representation. Meanwhile, to balance the adversarial training and preserve the domain-private features in the target domain, we reconstruct the target data from both forward and backward directions. Finally, based on the learned features, keyphrase are extracted using a tagging method. Experiments on two realworld cross-domain scenarios demonstrate that our method can significantly improve the performance of keyphrase extraction on unlabeled or insufficiently labeled target domain.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115805917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiawei Chen, C. Wang, M. Ester, Qihao Shi, Yan Feng, Chun Chen
With the explosive growth of online social networks, many social recommendation methods have been proposed and demonstrated that social information has potential to improve the recommendation performance. However, existing social recommendation methods always assume that the data is missing at random (MAR) but this is rarely the case. In fact, by analysing two real-world social recommendation datasets, we observed the following interesting phenomena: (1) users tend to consume and rate the items that they like and the items that have been consumed by their friends. (2) When the items have been consumed by more friends, the average values of the observed ratings will become smaller, not larger as assumed by the existing models. To model these phenomena, we integrate the missing not at random (MNAR) assumption in social recommendation and propose a new social recommendation method SPMF-MNAR, which models the observation process of rating data based on user's preference and social influence. Extensive experiments conducted on large real-world datasets validate that SPMF-MNAR achieves better performance than existing social recommendation methods and the non-social methods based on MNAR assumption.
随着在线社交网络的爆炸式增长,人们提出了许多社交推荐方法,并证明社交信息具有提高推荐性能的潜力。然而,现有的社会推荐方法总是假设数据随机缺失(MAR),但这种情况很少发生。事实上,通过分析两个现实世界的社交推荐数据集,我们观察到以下有趣的现象:(1)用户倾向于消费和评价他们喜欢的商品和他们的朋友已经消费过的商品。(2)当该商品被更多的朋友消费时,观察到的评分平均值会变小,而不是像现有模型假设的那样变大。为了对这些现象进行建模,我们将缺失非随机假设(missing not at random, MNAR)融入到社会推荐中,提出了一种新的社会推荐方法SPMF-MNAR,该方法基于用户偏好和社会影响对评分数据的观察过程进行建模。在大型真实数据集上进行的大量实验验证了SPMF-MNAR比现有的社会推荐方法和基于MNAR假设的非社会推荐方法取得了更好的性能。
{"title":"Social Recommendation with Missing Not at Random Data","authors":"Jiawei Chen, C. Wang, M. Ester, Qihao Shi, Yan Feng, Chun Chen","doi":"10.1109/ICDM.2018.00018","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00018","url":null,"abstract":"With the explosive growth of online social networks, many social recommendation methods have been proposed and demonstrated that social information has potential to improve the recommendation performance. However, existing social recommendation methods always assume that the data is missing at random (MAR) but this is rarely the case. In fact, by analysing two real-world social recommendation datasets, we observed the following interesting phenomena: (1) users tend to consume and rate the items that they like and the items that have been consumed by their friends. (2) When the items have been consumed by more friends, the average values of the observed ratings will become smaller, not larger as assumed by the existing models. To model these phenomena, we integrate the missing not at random (MNAR) assumption in social recommendation and propose a new social recommendation method SPMF-MNAR, which models the observation process of rating data based on user's preference and social influence. Extensive experiments conducted on large real-world datasets validate that SPMF-MNAR achieves better performance than existing social recommendation methods and the non-social methods based on MNAR assumption.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114900947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Discovering community structure in networks remains a fundamentally challenging task. From scientific domains such as biology, chemistry and physics to social networks the challenge of identifying community structures in different kinds of network is challenging since there is no universal definition of community structure. Furthermore, with the surge of social networks, content information has played a pivotal role in defining community structure, demanding techniques beyond its traditional approach. Recently, network representation learning have shown tremendous promise. Leveraging on recent advances in deep learning, one can exploit deep learning's superiority to a network problem. Most predominantly, successes in supervised and semi-supervised task has shown promising results in network representation learning tasks such as link prediction and graph classification. However, much has yet to be explored in the literature of community detection which is an unsupervised learning task. This paper proposes a deep generative model for community detection and network generation. Empowered with Bayesian deep learning, deep generative models are capable of exploiting non-linearities while giving insights in terms of uncertainty. Hence, this paper proposes Variational Graph Autoencoder for Community Detection (VGAECD). Extensive experiment shows that it is capable of outperforming existing state-of-the-art methods. The generalization of the proposed model also allows the model to be considered as a graph generator. Additionally, unlike traditional methods, the proposed model does not require a predefined community structure definition. Instead, it assumes the existence of latent similarity between nodes and allows the model to find these similarities through an automatic model selection process. Optionally, it is capable of exploiting feature-rich information of a network such as node content, further increasing its performance.
{"title":"Learning Community Structure with Variational Autoencoder","authors":"Jun Jin Choong, Xin Liu, T. Murata","doi":"10.1109/ICDM.2018.00022","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00022","url":null,"abstract":"Discovering community structure in networks remains a fundamentally challenging task. From scientific domains such as biology, chemistry and physics to social networks the challenge of identifying community structures in different kinds of network is challenging since there is no universal definition of community structure. Furthermore, with the surge of social networks, content information has played a pivotal role in defining community structure, demanding techniques beyond its traditional approach. Recently, network representation learning have shown tremendous promise. Leveraging on recent advances in deep learning, one can exploit deep learning's superiority to a network problem. Most predominantly, successes in supervised and semi-supervised task has shown promising results in network representation learning tasks such as link prediction and graph classification. However, much has yet to be explored in the literature of community detection which is an unsupervised learning task. This paper proposes a deep generative model for community detection and network generation. Empowered with Bayesian deep learning, deep generative models are capable of exploiting non-linearities while giving insights in terms of uncertainty. Hence, this paper proposes Variational Graph Autoencoder for Community Detection (VGAECD). Extensive experiment shows that it is capable of outperforming existing state-of-the-art methods. The generalization of the proposed model also allows the model to be considered as a graph generator. Additionally, unlike traditional methods, the proposed model does not require a predefined community structure definition. Instead, it assumes the existence of latent similarity between nodes and allows the model to find these similarities through an automatic model selection process. Optionally, it is capable of exploiting feature-rich information of a network such as node content, further increasing its performance.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125282949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rosni Lumbantoruan, Xiangmin Zhou, Yongli Ren, Z. Bao
Context-aware recommendation has emerged as perhaps the most popular service over online sites, and has seen applications to domains as diverse as entertainment, e-business, e-health and government services. There has been recent significant progress on the quality and scalability of recommender systems. However, we believe that different target users concern different contexts when they select an online item, which can greatly affect the quality of recommendation, and have not been investigated yet. In this paper, we propose a new type of recommender system, Declarative Context-Aware Recommender System (D-CARS), which enables the personalization of the contexts exploited for each target user by automatically analysing the viewing history of users. First, we propose a novel User-Window Non-negative Matrix Factorization topic model (UW-NMF) that adaptively identifies the significant contexts of users and constructs user profiles in a personalized manner. Then, we design a novel declarative context-aware recommendation algorithm that exploits the user context preference to identify a group of item candidates and its context distribution, based on a Subspace Ensemble Tree Model (SETM), which is constructed in the identified context subspace for item recommendation. Finally, we propose an algorithm that incrementally maintains our SETM model. Extensive experiments are conducted to prove the high effectiveness and efficiency of our D-CARS system.
{"title":"D-CARS: A Declarative Context-Aware Recommender System","authors":"Rosni Lumbantoruan, Xiangmin Zhou, Yongli Ren, Z. Bao","doi":"10.1109/ICDM.2018.00151","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00151","url":null,"abstract":"Context-aware recommendation has emerged as perhaps the most popular service over online sites, and has seen applications to domains as diverse as entertainment, e-business, e-health and government services. There has been recent significant progress on the quality and scalability of recommender systems. However, we believe that different target users concern different contexts when they select an online item, which can greatly affect the quality of recommendation, and have not been investigated yet. In this paper, we propose a new type of recommender system, Declarative Context-Aware Recommender System (D-CARS), which enables the personalization of the contexts exploited for each target user by automatically analysing the viewing history of users. First, we propose a novel User-Window Non-negative Matrix Factorization topic model (UW-NMF) that adaptively identifies the significant contexts of users and constructs user profiles in a personalized manner. Then, we design a novel declarative context-aware recommendation algorithm that exploits the user context preference to identify a group of item candidates and its context distribution, based on a Subspace Ensemble Tree Model (SETM), which is constructed in the identified context subspace for item recommendation. Finally, we propose an algorithm that incrementally maintains our SETM model. Extensive experiments are conducted to prove the high effectiveness and efficiency of our D-CARS system.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126757732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haripriya Harikumar, Santu Rana, Sunil Gupta, Thin Nguyen, R. Kaimal, S. Venkatesh
Privacy preservation is important. Prescriptive analytics is a method to extract corrective actions to avoid undesirable outcomes. We propose a privacy preserving prescriptive analytics algorithm to protect the data used during the construction of the prescriptive analytics algorithm. We use differential privacy mechanism to achieve strong privacy guarantee. Differential privacy mechanism requires computation of sensitivity: maximum change in the output between two training datasets, which is differed by only one instance. The main challenge we addressed is the computation of sensitivity of the prescription vector. In absence of any analytical form, we construct a nested global optimization problem to compute the sensitivity. We solve the optimization problem using constrained Bayesian optimization, as the nested structure makes the objective function expensive. We demonstrate our algorithm on two real world datasets and observe that the prescription vectors remains useful even after making them private.
{"title":"Differentially Private Prescriptive Analytics","authors":"Haripriya Harikumar, Santu Rana, Sunil Gupta, Thin Nguyen, R. Kaimal, S. Venkatesh","doi":"10.1109/ICDM.2018.00124","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00124","url":null,"abstract":"Privacy preservation is important. Prescriptive analytics is a method to extract corrective actions to avoid undesirable outcomes. We propose a privacy preserving prescriptive analytics algorithm to protect the data used during the construction of the prescriptive analytics algorithm. We use differential privacy mechanism to achieve strong privacy guarantee. Differential privacy mechanism requires computation of sensitivity: maximum change in the output between two training datasets, which is differed by only one instance. The main challenge we addressed is the computation of sensitivity of the prescription vector. In absence of any analytical form, we construct a nested global optimization problem to compute the sensitivity. We solve the optimization problem using constrained Bayesian optimization, as the nested structure makes the objective function expensive. We demonstrate our algorithm on two real world datasets and observe that the prescription vectors remains useful even after making them private.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125129810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
From personalized ad delivery and healthcare to criminal sentencing, more decisions are made with help from methods developed in the fields of data mining and machine learning than ever before. However, their widespread use has raised concerns about the discriminatory impact which the methods may have on people subject to these decisions. Recently, imbalance in the misclassification rates between groups has been identified as a source of discrimination. Such discrimination is not handled by most existing work in discrimination-aware data mining, and it can persist even if other types of discrimination are alleviated. In this article, we present the Balancing Terms (BT) method to address this problem. BT balances the error rates of any classifier with a differentiable prediction function, and unlike existing work, it can incorporate a preference for the trade-off between fairness and accuracy. We empirically evaluate BT on real-world data, demonstrating that our method produces tradeoffs between error rate balance and total classification error that are superior and in only few cases comparable to the state-of-the-art.
{"title":"Using Balancing Terms to Avoid Discrimination in Classification","authors":"Simon Enni, I. Assent","doi":"10.1109/ICDM.2018.00116","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00116","url":null,"abstract":"From personalized ad delivery and healthcare to criminal sentencing, more decisions are made with help from methods developed in the fields of data mining and machine learning than ever before. However, their widespread use has raised concerns about the discriminatory impact which the methods may have on people subject to these decisions. Recently, imbalance in the misclassification rates between groups has been identified as a source of discrimination. Such discrimination is not handled by most existing work in discrimination-aware data mining, and it can persist even if other types of discrimination are alleviated. In this article, we present the Balancing Terms (BT) method to address this problem. BT balances the error rates of any classifier with a differentiable prediction function, and unlike existing work, it can incorporate a preference for the trade-off between fairness and accuracy. We empirically evaluate BT on real-world data, demonstrating that our method produces tradeoffs between error rate balance and total classification error that are superior and in only few cases comparable to the state-of-the-art.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"604 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125185613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinzheng Tu, Guoxian Yu, C. Domeniconi, J. Wang, Guoqiang Xiao, Maozu Guo
Crowdsourcing is a useful and economic approach to data annotation. To obtain annotation of high quality, various aggregation approaches have been developed, which take into account different factors that impact the quality of aggregated answers. However, existing methods generally focus on single-label (multi-class and binary) tasks, and they ignore the inter-correlation between labels, and thus may have compromised quality. In this paper, we introduce a Multi-Label answer aggregation approach based on Joint Matrix Factorization (ML-JMF). ML-JMF selectively and jointly factorizes the sample-label association matrices collected from different annotators into products of individual and shared low-rank matrices. As such, it takes advantage of the robustness of low-rank matrix approximation to noise, and reduces the impact of unreliable annotators by assigning small (zero) weights to their annotation matrices. In addition, it takes advantage of the correlation among labels by leveraging the shared low-rank matrix, and of the similarity between annotators using the individual low-rank matrices to guide the factorization. ML-JMF pursues the low-rank matrices via a unified objective function, and introduces an iterative technique to optimize it. ML-JMF finally uses the optimized low-rank matrices and weights to infer the ground-truth labels. Our experimental results on multi-label datasets show that ML-JMF outperforms competitive methods in inferring ground truth labels. Our approach can identify unreliable annotators, and is robust against their misleading answers through the assignment of small (zero) weights to their annotation.
{"title":"Multi-label Answer Aggregation Based on Joint Matrix Factorization","authors":"Jinzheng Tu, Guoxian Yu, C. Domeniconi, J. Wang, Guoqiang Xiao, Maozu Guo","doi":"10.1109/ICDM.2018.00067","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00067","url":null,"abstract":"Crowdsourcing is a useful and economic approach to data annotation. To obtain annotation of high quality, various aggregation approaches have been developed, which take into account different factors that impact the quality of aggregated answers. However, existing methods generally focus on single-label (multi-class and binary) tasks, and they ignore the inter-correlation between labels, and thus may have compromised quality. In this paper, we introduce a Multi-Label answer aggregation approach based on Joint Matrix Factorization (ML-JMF). ML-JMF selectively and jointly factorizes the sample-label association matrices collected from different annotators into products of individual and shared low-rank matrices. As such, it takes advantage of the robustness of low-rank matrix approximation to noise, and reduces the impact of unreliable annotators by assigning small (zero) weights to their annotation matrices. In addition, it takes advantage of the correlation among labels by leveraging the shared low-rank matrix, and of the similarity between annotators using the individual low-rank matrices to guide the factorization. ML-JMF pursues the low-rank matrices via a unified objective function, and introduces an iterative technique to optimize it. ML-JMF finally uses the optimized low-rank matrices and weights to infer the ground-truth labels. Our experimental results on multi-label datasets show that ML-JMF outperforms competitive methods in inferring ground truth labels. Our approach can identify unreliable annotators, and is robust against their misleading answers through the assignment of small (zero) weights to their annotation.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122052966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}