Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00021
Diana Nurbakova, Timothée Saumet
We present PrediTilk, a data-driven win prediction service based on user's browsing of sales content, such as quotes, competitor comparisons or product sheets. It makes part of our GDPR-compliant system of electronic document tracking designed for marketing and sales, and addresses win prediction problem (also known as deal closure prediction). The latter consists in estimating the probability of a given opportunity to close, becoming a customer. Given the information about user's consultation of documents issued from our tracking system, our service predicts win probability of this opportunity using machine learning models. Our evaluation shows that PrediTilk provides accurate predictions, while being purely based on automatically collected data about user's browsing behaviour. Besides, it can provide objective signals to a CRM system, where most of the prospects data are entered manually. The combination of such sources can become a highly valuable asset for win prediction.
{"title":"Deal Closure Prediction based on User's Browsing Behaviour of Sales Content","authors":"Diana Nurbakova, Timothée Saumet","doi":"10.1109/ICDMW51313.2020.00021","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00021","url":null,"abstract":"We present PrediTilk, a data-driven win prediction service based on user's browsing of sales content, such as quotes, competitor comparisons or product sheets. It makes part of our GDPR-compliant system of electronic document tracking designed for marketing and sales, and addresses win prediction problem (also known as deal closure prediction). The latter consists in estimating the probability of a given opportunity to close, becoming a customer. Given the information about user's consultation of documents issued from our tracking system, our service predicts win probability of this opportunity using machine learning models. Our evaluation shows that PrediTilk provides accurate predictions, while being purely based on automatically collected data about user's browsing behaviour. Besides, it can provide objective signals to a CRM system, where most of the prospects data are entered manually. The combination of such sources can become a highly valuable asset for win prediction.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115363613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00091
Frank Ruis, Shreyasi Pathak, Jeroen Geerdink, J. H. Hegeman, C. Seifert, M. V. Keulen
Electronic health records contain important information written in free-form text. They are often highly unstructured and ungrammatical and contain misspellings and abbreviations, making it difficult to apply traditional natural language processing techniques. Annotated data is hard to come by due to restricted access, and supervised models often don't generalize well to other datasets. We propose a language-agnostic human-in-the-loop approach for extracting medication names from a large set of highly unstructured electronic health records, where we reach almost 97% recall on our test set after the second iteration while maintaining 100% precision. Starting with a bootstrap lexicon we perform a context based dictionary expansion curated by a human reviewer. The method can handle ambiguous lexicon entries and efficiently find fuzzy matches without producing false positives. The human review step ensures a high precision, which is especially important in healthcare, and is not subject to disagreements with annotations from an external source. The code is available online 11https://github.com/FrankRuis/medical_concept_extraction.
{"title":"Human-in-the-loop Language-agnostic Extraction of Medication Data from Highly Unstructured Electronic Health Records","authors":"Frank Ruis, Shreyasi Pathak, Jeroen Geerdink, J. H. Hegeman, C. Seifert, M. V. Keulen","doi":"10.1109/ICDMW51313.2020.00091","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00091","url":null,"abstract":"Electronic health records contain important information written in free-form text. They are often highly unstructured and ungrammatical and contain misspellings and abbreviations, making it difficult to apply traditional natural language processing techniques. Annotated data is hard to come by due to restricted access, and supervised models often don't generalize well to other datasets. We propose a language-agnostic human-in-the-loop approach for extracting medication names from a large set of highly unstructured electronic health records, where we reach almost 97% recall on our test set after the second iteration while maintaining 100% precision. Starting with a bootstrap lexicon we perform a context based dictionary expansion curated by a human reviewer. The method can handle ambiguous lexicon entries and efficiently find fuzzy matches without producing false positives. The human review step ensures a high precision, which is especially important in healthcare, and is not subject to disagreements with annotations from an external source. The code is available online 11https://github.com/FrankRuis/medical_concept_extraction.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130103608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00048
Julie Jiang, Kristina Lerman, Emilio Ferrara
Individual behavior and decisions are substantially influenced by their contexts, such as location, environment, and time. Changes along these dimensions can be readily observed in Multiplayer Online Battle Arena games (MOBA), where players face different in-game settings for each match and are subject to frequent game patches. Existing methods utilizing contextual information generalize the effect of a context over the entire population, but contextual information tailored to each individual can be more effective. To achieve this, we present the Neural Individualized Context-aware Embeddings (NICE) model for predicting user performance and game outcomes. Our proposed method identifies individual behavioral differences in different contexts by learning latent representations of users and contexts through non-negative tensor factorization. Using a dataset from the MOBA game League of Legends, we demonstrate that our model substantially improves the prediction of winning outcome, individual user performance, and user engagement.
{"title":"Individualized Context-Aware Tensor Factorization for Online Games Predictions","authors":"Julie Jiang, Kristina Lerman, Emilio Ferrara","doi":"10.1109/ICDMW51313.2020.00048","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00048","url":null,"abstract":"Individual behavior and decisions are substantially influenced by their contexts, such as location, environment, and time. Changes along these dimensions can be readily observed in Multiplayer Online Battle Arena games (MOBA), where players face different in-game settings for each match and are subject to frequent game patches. Existing methods utilizing contextual information generalize the effect of a context over the entire population, but contextual information tailored to each individual can be more effective. To achieve this, we present the Neural Individualized Context-aware Embeddings (NICE) model for predicting user performance and game outcomes. Our proposed method identifies individual behavioral differences in different contexts by learning latent representations of users and contexts through non-negative tensor factorization. Using a dataset from the MOBA game League of Legends, we demonstrate that our model substantially improves the prediction of winning outcome, individual user performance, and user engagement.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122400386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00031
Ying Liufu, Long Jin, Mei Liu, Shuai Li
In this article, a traditional recommender algorithm termed gradient recurrent neural network (GRNN) model is introduced. Allowing for numerous practical problems such as the problems related to recommender systems or multi-agent systems that can be turned into matrix equation problems to resolve, the GRNN model becomes a more critical and promising role. The GRNN model, designed with the assistance of a square-norm-based energy function, is quite applicable to a recommender system and substantiated to be high-efficient in solving convex optimization linear or nonlinear problems. Simultaneously, implementing elaborately a theoretical analysis and numerical experiment computational simulation, the inherent exponential and stable convergence of the GRNN model is validated. With the aid of it, a theoretical nontrivial solution of the Yang-Baxter-like matrix equation $XAX=AXA$ can be obtained successfully.
{"title":"A Recommender Algorithm: Gradient Recurrent Neural Network Applied to Yang-Baxter-Like Equation","authors":"Ying Liufu, Long Jin, Mei Liu, Shuai Li","doi":"10.1109/ICDMW51313.2020.00031","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00031","url":null,"abstract":"In this article, a traditional recommender algorithm termed gradient recurrent neural network (GRNN) model is introduced. Allowing for numerous practical problems such as the problems related to recommender systems or multi-agent systems that can be turned into matrix equation problems to resolve, the GRNN model becomes a more critical and promising role. The GRNN model, designed with the assistance of a square-norm-based energy function, is quite applicable to a recommender system and substantiated to be high-efficient in solving convex optimization linear or nonlinear problems. Simultaneously, implementing elaborately a theoretical analysis and numerical experiment computational simulation, the inherent exponential and stable convergence of the GRNN model is validated. With the aid of it, a theoretical nontrivial solution of the Yang-Baxter-like matrix equation $XAX=AXA$ can be obtained successfully.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127802264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00099
Changhwan Kim, Seyoung Yun
Recently, precipitation nowcasting has gained significant attention. For instance, the demand for precise precipitation nowcasting is significantly increasing in South Korea since the economic damage has been severe in recent days because of frequent and unexpected heavy rainfall. In this paper, we propose a U-Net based deep learning model that predicts from a numerical model and then corrects the data using the U-Net based deep learning model so that it can improve the accuracy of the final prediction. We use two data sets: reanalysis data and LDAPS(Local Data Assimilation and Prediction System) prediction data. Both data sets are grid-based data that covers the whole South Korea region. We first experiment with reanalysis data to identify that our U-Net model can find atmospheric dynamics patterns, even if it is not image data. Next, we use LDAPS prediction data and apply it to the U-Net model. Because LDAPS prediction data is also a prediction, we essentially conduct correcting task for this data. To this aim, a learnable layer is added at the front of the U-Net model and concatenated with the input batch to learn location-specific information. The experiment shows that the U-Net based model can find patterns using reanalysis data. Further, it has the potential to improve the accuracy of LDAPS prediction data. We also find that the learnable layer enhances test accuracy.
近年来,降水临近预报引起了人们的广泛关注。例如,在韩国,由于频繁和意外的强降雨,最近几天经济损失严重,因此对精确降水临近预报的需求正在显著增加。在本文中,我们提出了一种基于U-Net的深度学习模型,该模型从数值模型进行预测,然后使用基于U-Net的深度学习模型对数据进行校正,从而提高最终预测的准确性。我们使用了两个数据集:再分析数据和LDAPS(Local data Assimilation and Prediction System)预测数据。这两个数据集都是基于网格的数据,覆盖了整个韩国地区。我们首先对再分析数据进行实验,以确定我们的U-Net模型可以找到大气动力学模式,即使它不是图像数据。接下来,我们使用LDAPS预测数据并将其应用于U-Net模型。由于LDAPS预测数据也是一种预测,我们本质上是对该数据进行校正任务。为此,在U-Net模型的前面添加一个可学习层,并与输入批连接以学习特定位置的信息。实验表明,基于U-Net的模型可以利用再分析数据找到模式。此外,它有可能提高LDAPS预测数据的准确性。我们还发现可学习层提高了测试精度。
{"title":"Precipitation Nowcasting Using Grid-based Data in South Korea Region","authors":"Changhwan Kim, Seyoung Yun","doi":"10.1109/ICDMW51313.2020.00099","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00099","url":null,"abstract":"Recently, precipitation nowcasting has gained significant attention. For instance, the demand for precise precipitation nowcasting is significantly increasing in South Korea since the economic damage has been severe in recent days because of frequent and unexpected heavy rainfall. In this paper, we propose a U-Net based deep learning model that predicts from a numerical model and then corrects the data using the U-Net based deep learning model so that it can improve the accuracy of the final prediction. We use two data sets: reanalysis data and LDAPS(Local Data Assimilation and Prediction System) prediction data. Both data sets are grid-based data that covers the whole South Korea region. We first experiment with reanalysis data to identify that our U-Net model can find atmospheric dynamics patterns, even if it is not image data. Next, we use LDAPS prediction data and apply it to the U-Net model. Because LDAPS prediction data is also a prediction, we essentially conduct correcting task for this data. To this aim, a learnable layer is added at the front of the U-Net model and concatenated with the input batch to learn location-specific information. The experiment shows that the U-Net based model can find patterns using reanalysis data. Further, it has the potential to improve the accuracy of LDAPS prediction data. We also find that the learnable layer enhances test accuracy.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130386452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00124
P. Vafaie, H. Viktor, W. Michalowski
Multi-class imbalance, in which the rates of instances in the various classes differ substantially, poses a major challenge when learning from evolving streams. In this setting, minority class instances may arrive infrequently and in bursts, making accurate model construction problematic. Further, skewed streams are not only susceptible to concept drifts, but class labels may also be absent, expensive to obtain, or only arrive after some delay. The combined effects of multi-class skew, concept drift and semi-supervised learning have received limited attention in the online learning community. In this paper, we introduce a multi-class online ensemble algorithm that is suitable for learning in such settings. Specifically, our algorithm uses sampling with replacement while dynamically increasing the weights of underrepresented classes based on recall in order to produce models that benefit all classes. Our approach addresses the potential lack of labels by incorporating a self-training semi-supervised learning method for labeling instances. Our experimental results show that our online ensemble performs well against multi-class imbalanced data containing concept drifts. In addition, our algorithm produces accurate predictions, even in the presence of unlabeled data.
{"title":"Multi-class imbalanced semi-supervised learning from streams through online ensembles","authors":"P. Vafaie, H. Viktor, W. Michalowski","doi":"10.1109/ICDMW51313.2020.00124","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00124","url":null,"abstract":"Multi-class imbalance, in which the rates of instances in the various classes differ substantially, poses a major challenge when learning from evolving streams. In this setting, minority class instances may arrive infrequently and in bursts, making accurate model construction problematic. Further, skewed streams are not only susceptible to concept drifts, but class labels may also be absent, expensive to obtain, or only arrive after some delay. The combined effects of multi-class skew, concept drift and semi-supervised learning have received limited attention in the online learning community. In this paper, we introduce a multi-class online ensemble algorithm that is suitable for learning in such settings. Specifically, our algorithm uses sampling with replacement while dynamically increasing the weights of underrepresented classes based on recall in order to produce models that benefit all classes. Our approach addresses the potential lack of labels by incorporating a self-training semi-supervised learning method for labeling instances. Our experimental results show that our online ensemble performs well against multi-class imbalanced data containing concept drifts. In addition, our algorithm produces accurate predictions, even in the presence of unlabeled data.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131841475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00044
Michael Franklin Mbouopda, E. Nguifo
Time series classification is a task that aims at classifying chronological data. It is used in a diverse range of domains such as meteorology, medicine and physics. In the last decade, many algorithms have been built to perform this task with very appreciable accuracy. However, applications where time series have uncertainty has been under-explored. Using uncertainty propagation techniques, we propose a new uncertain dissimilarity measure based on Euclidean distance. We then propose the uncertain shapelet transform algorithm for the classification of uncertain time series. The large experiments we conducted on state of the art datasets show the effectiveness of our contribution. The source code of our contribution and the datasets we used are all available on a public repository.
{"title":"Uncertain Time Series Classification with Shapelet Transform","authors":"Michael Franklin Mbouopda, E. Nguifo","doi":"10.1109/ICDMW51313.2020.00044","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00044","url":null,"abstract":"Time series classification is a task that aims at classifying chronological data. It is used in a diverse range of domains such as meteorology, medicine and physics. In the last decade, many algorithms have been built to perform this task with very appreciable accuracy. However, applications where time series have uncertainty has been under-explored. Using uncertainty propagation techniques, we propose a new uncertain dissimilarity measure based on Euclidean distance. We then propose the uncertain shapelet transform algorithm for the classification of uncertain time series. The large experiments we conducted on state of the art datasets show the effectiveness of our contribution. The source code of our contribution and the datasets we used are all available on a public repository.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128985651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00088
D. Kosmajac, Kirstie Smith, Vlado Keselj, S. Kirkland
Open-ended questions are a very important part of research surveys. However, they can pose a challenge when it comes to processing since manual processing requires a labour-intensive human effort. Automation of the task requires application of NLP methods since free text does not ensure standardized structure. To tackle this problem, we present a solution for topic discovery and analysis of open-ended survey items. We use graph-based representation of the text that adds structure and enables easier manipulation and keyphrase retrieval. Additionally, we use pre-trained fastText aligned word vectors to cluster similar phrases even if they are written in different languages. The goal is to produce topic word and phrase representatives that are easy to interpret by a domain expert. We compare the method with traditional LDA and two state-of-the-art algorithms: BTM and WNTM. The resulting keyphrases representing topics are more intuitive to the domain experts than the ones obtained by reference topic models in similar experimental settings.
{"title":"Graph-based Topic Extraction Using Centroid Distance of Phrase Embeddings on Healthy Aging Open-ended Survey Questions","authors":"D. Kosmajac, Kirstie Smith, Vlado Keselj, S. Kirkland","doi":"10.1109/ICDMW51313.2020.00088","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00088","url":null,"abstract":"Open-ended questions are a very important part of research surveys. However, they can pose a challenge when it comes to processing since manual processing requires a labour-intensive human effort. Automation of the task requires application of NLP methods since free text does not ensure standardized structure. To tackle this problem, we present a solution for topic discovery and analysis of open-ended survey items. We use graph-based representation of the text that adds structure and enables easier manipulation and keyphrase retrieval. Additionally, we use pre-trained fastText aligned word vectors to cluster similar phrases even if they are written in different languages. The goal is to produce topic word and phrase representatives that are easy to interpret by a domain expert. We compare the method with traditional LDA and two state-of-the-art algorithms: BTM and WNTM. The resulting keyphrases representing topics are more intuitive to the domain experts than the ones obtained by reference topic models in similar experimental settings.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122394878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00072
Fangyu Lin, Yizhi Liu, Mohammadreza Ebrahimi, Zara Ahmad-Post, J. Hu, Jingyu Xin, S. Samtani, Weifeng Li, Hsinchun Chen
The information privacy of the Internet users has become a major societal concern. The rapid growth of online services increases the risk of unauthorized access to Personally Identifiable Information (PII) of at-risk populations, who are unaware of their PII exposure. To proactively identify online at-risk populations and increase their privacy awareness, it is crucial to conduct a holistic privacy risk assessment across the internet. Current privacy risk assessment studies are limited to a single platform within either the surface web or the dark web. A comprehensive privacy risk assessment requires matching exposed PII on heterogeneous online platforms across the surface web and the dark web. However, due to the incompleteness and inaccuracy of PII records in each platform, linking the exposed PII to users is a non-trivial task. While Entity Resolution (ER) techniques can be used to facilitate this task, they often require ad-hoc, manual rule development and feature engineering. Recently, Deep Learning (DL)-based ER has outperformed manual entity matching rules by automatically extracting prominent features from incomplete or inaccurate records. In this study, we enhance the existing privacy risk assessment with a DL-based ER method, namely Multi-Context Attention (MCA), to comprehensively evaluate individuals' PII exposure across the different online platforms in the dark web and surface web. Evaluation against benchmark ER models indicates the efficacy of MCA. Using MCA on a random sample of data breach victims in the dark web, we are able to identify 4.3% of the victims on the surface web platforms and calculate their privacy risk scores.
{"title":"Linking Personally Identifiable Information from the Dark Web to the Surface Web: A Deep Entity Resolution Approach","authors":"Fangyu Lin, Yizhi Liu, Mohammadreza Ebrahimi, Zara Ahmad-Post, J. Hu, Jingyu Xin, S. Samtani, Weifeng Li, Hsinchun Chen","doi":"10.1109/ICDMW51313.2020.00072","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00072","url":null,"abstract":"The information privacy of the Internet users has become a major societal concern. The rapid growth of online services increases the risk of unauthorized access to Personally Identifiable Information (PII) of at-risk populations, who are unaware of their PII exposure. To proactively identify online at-risk populations and increase their privacy awareness, it is crucial to conduct a holistic privacy risk assessment across the internet. Current privacy risk assessment studies are limited to a single platform within either the surface web or the dark web. A comprehensive privacy risk assessment requires matching exposed PII on heterogeneous online platforms across the surface web and the dark web. However, due to the incompleteness and inaccuracy of PII records in each platform, linking the exposed PII to users is a non-trivial task. While Entity Resolution (ER) techniques can be used to facilitate this task, they often require ad-hoc, manual rule development and feature engineering. Recently, Deep Learning (DL)-based ER has outperformed manual entity matching rules by automatically extracting prominent features from incomplete or inaccurate records. In this study, we enhance the existing privacy risk assessment with a DL-based ER method, namely Multi-Context Attention (MCA), to comprehensively evaluate individuals' PII exposure across the different online platforms in the dark web and surface web. Evaluation against benchmark ER models indicates the efficacy of MCA. Using MCA on a random sample of data breach victims in the dark web, we are able to identify 4.3% of the victims on the surface web platforms and calculate their privacy risk scores.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116406384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00081
Haonan Huang, Naiyao Liang, Wei Yan, Zuyuan Yang, Weijun Sun
Since many real-world data can be described from multiple views, multi-view learning has attracted considerable attention. Various methods have been proposed and successfully applied to multi-view learning, typically based on matrix factorization models. Recently, it is extended to the deep structure to exploit the hierarchical information of multi-view data, but the view-specific features and the label information are seldom considered. To address these concerns, we present a partially shared semi-supervised deep matrix factorization model (PSDMF). By integrating the partially shared deep decomposition structure, graph regularization and the semi-supervised regression model, PSDMF can learn a compact and discriminative representation through eliminating the effects of uncorrelated information. In addition, we develop an efficient iterative updating algorithm for PSDMF. Extensive experiments on five benchmark datasets demonstrate that PSDMF can achieve better performance than the state-of-the-art multi-view learning approaches. The MATLAB source code is available at https://github.com/libertyhhn/PartiallySharedDMF.
{"title":"Partially Shared Semi-supervised Deep Matrix Factorization with Multi-view Data","authors":"Haonan Huang, Naiyao Liang, Wei Yan, Zuyuan Yang, Weijun Sun","doi":"10.1109/ICDMW51313.2020.00081","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00081","url":null,"abstract":"Since many real-world data can be described from multiple views, multi-view learning has attracted considerable attention. Various methods have been proposed and successfully applied to multi-view learning, typically based on matrix factorization models. Recently, it is extended to the deep structure to exploit the hierarchical information of multi-view data, but the view-specific features and the label information are seldom considered. To address these concerns, we present a partially shared semi-supervised deep matrix factorization model (PSDMF). By integrating the partially shared deep decomposition structure, graph regularization and the semi-supervised regression model, PSDMF can learn a compact and discriminative representation through eliminating the effects of uncorrelated information. In addition, we develop an efficient iterative updating algorithm for PSDMF. Extensive experiments on five benchmark datasets demonstrate that PSDMF can achieve better performance than the state-of-the-art multi-view learning approaches. The MATLAB source code is available at https://github.com/libertyhhn/PartiallySharedDMF.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115365130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}