首页 > 最新文献

2020 International Conference on Data Mining Workshops (ICDMW)最新文献

英文 中文
Deal Closure Prediction based on User's Browsing Behaviour of Sales Content 基于用户销售内容浏览行为的交易成交预测
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00021
Diana Nurbakova, Timothée Saumet
We present PrediTilk, a data-driven win prediction service based on user's browsing of sales content, such as quotes, competitor comparisons or product sheets. It makes part of our GDPR-compliant system of electronic document tracking designed for marketing and sales, and addresses win prediction problem (also known as deal closure prediction). The latter consists in estimating the probability of a given opportunity to close, becoming a customer. Given the information about user's consultation of documents issued from our tracking system, our service predicts win probability of this opportunity using machine learning models. Our evaluation shows that PrediTilk provides accurate predictions, while being purely based on automatically collected data about user's browsing behaviour. Besides, it can provide objective signals to a CRM system, where most of the prospects data are entered manually. The combination of such sources can become a highly valuable asset for win prediction.
我们提出PrediTilk,一个数据驱动的获胜预测服务,基于用户浏览销售内容,如报价,竞争对手比较或产品表。它是我们为营销和销售设计的符合gdpr的电子文档跟踪系统的一部分,并解决了获胜预测问题(也称为交易完成预测)。后者包括估计一个给定的机会完成交易,成为客户的可能性。考虑到用户对我们跟踪系统发出的文件的咨询信息,我们的服务使用机器学习模型预测这个机会的获胜概率。我们的评估表明,PrediTilk提供了准确的预测,同时完全基于自动收集的用户浏览行为数据。此外,它还可以为CRM系统提供客观的信号,而CRM系统中的大多数潜在客户数据都是手动输入的。这些资源的组合可以成为预测胜利的非常有价值的资产。
{"title":"Deal Closure Prediction based on User's Browsing Behaviour of Sales Content","authors":"Diana Nurbakova, Timothée Saumet","doi":"10.1109/ICDMW51313.2020.00021","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00021","url":null,"abstract":"We present PrediTilk, a data-driven win prediction service based on user's browsing of sales content, such as quotes, competitor comparisons or product sheets. It makes part of our GDPR-compliant system of electronic document tracking designed for marketing and sales, and addresses win prediction problem (also known as deal closure prediction). The latter consists in estimating the probability of a given opportunity to close, becoming a customer. Given the information about user's consultation of documents issued from our tracking system, our service predicts win probability of this opportunity using machine learning models. Our evaluation shows that PrediTilk provides accurate predictions, while being purely based on automatically collected data about user's browsing behaviour. Besides, it can provide objective signals to a CRM system, where most of the prospects data are entered manually. The combination of such sources can become a highly valuable asset for win prediction.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115363613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Human-in-the-loop Language-agnostic Extraction of Medication Data from Highly Unstructured Electronic Health Records 从高度非结构化的电子健康记录中提取药物数据的人在循环中的语言不可知
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00091
Frank Ruis, Shreyasi Pathak, Jeroen Geerdink, J. H. Hegeman, C. Seifert, M. V. Keulen
Electronic health records contain important information written in free-form text. They are often highly unstructured and ungrammatical and contain misspellings and abbreviations, making it difficult to apply traditional natural language processing techniques. Annotated data is hard to come by due to restricted access, and supervised models often don't generalize well to other datasets. We propose a language-agnostic human-in-the-loop approach for extracting medication names from a large set of highly unstructured electronic health records, where we reach almost 97% recall on our test set after the second iteration while maintaining 100% precision. Starting with a bootstrap lexicon we perform a context based dictionary expansion curated by a human reviewer. The method can handle ambiguous lexicon entries and efficiently find fuzzy matches without producing false positives. The human review step ensures a high precision, which is especially important in healthcare, and is not subject to disagreements with annotations from an external source. The code is available online 11https://github.com/FrankRuis/medical_concept_extraction.
电子健康记录包含以自由格式文本书写的重要信息。它们通常是非结构化的、不符合语法的,并且包含拼写错误和缩写,这使得应用传统的自然语言处理技术变得困难。由于访问受限,标注数据很难获得,并且监督模型通常不能很好地泛化到其他数据集。我们提出了一种与语言无关的人在循环方法,用于从大量高度非结构化的电子健康记录中提取药物名称,在第二次迭代之后,我们在测试集中达到了近97%的召回率,同时保持了100%的准确率。从一个引导词典开始,我们执行一个基于上下文的词典扩展,由一个人类审阅者策划。该方法可以处理歧义的词汇条目,并有效地找到模糊匹配而不产生误报。人工审查步骤确保了高精度,这在医疗保健中尤其重要,并且不会受到与外部来源注释不一致的影响。该代码可在11https://github.com/FrankRuis/medical_concept_extraction上获得。
{"title":"Human-in-the-loop Language-agnostic Extraction of Medication Data from Highly Unstructured Electronic Health Records","authors":"Frank Ruis, Shreyasi Pathak, Jeroen Geerdink, J. H. Hegeman, C. Seifert, M. V. Keulen","doi":"10.1109/ICDMW51313.2020.00091","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00091","url":null,"abstract":"Electronic health records contain important information written in free-form text. They are often highly unstructured and ungrammatical and contain misspellings and abbreviations, making it difficult to apply traditional natural language processing techniques. Annotated data is hard to come by due to restricted access, and supervised models often don't generalize well to other datasets. We propose a language-agnostic human-in-the-loop approach for extracting medication names from a large set of highly unstructured electronic health records, where we reach almost 97% recall on our test set after the second iteration while maintaining 100% precision. Starting with a bootstrap lexicon we perform a context based dictionary expansion curated by a human reviewer. The method can handle ambiguous lexicon entries and efficiently find fuzzy matches without producing false positives. The human review step ensures a high precision, which is especially important in healthcare, and is not subject to disagreements with annotations from an external source. The code is available online 11https://github.com/FrankRuis/medical_concept_extraction.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130103608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Individualized Context-Aware Tensor Factorization for Online Games Predictions 个性化上下文感知张量分解在线游戏预测
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00048
Julie Jiang, Kristina Lerman, Emilio Ferrara
Individual behavior and decisions are substantially influenced by their contexts, such as location, environment, and time. Changes along these dimensions can be readily observed in Multiplayer Online Battle Arena games (MOBA), where players face different in-game settings for each match and are subject to frequent game patches. Existing methods utilizing contextual information generalize the effect of a context over the entire population, but contextual information tailored to each individual can be more effective. To achieve this, we present the Neural Individualized Context-aware Embeddings (NICE) model for predicting user performance and game outcomes. Our proposed method identifies individual behavioral differences in different contexts by learning latent representations of users and contexts through non-negative tensor factorization. Using a dataset from the MOBA game League of Legends, we demonstrate that our model substantially improves the prediction of winning outcome, individual user performance, and user engagement.
个人的行为和决定在很大程度上受到其环境的影响,例如地点、环境和时间。这些方面的变化在多人在线竞技游戏(MOBA)中很容易观察到,玩家在每场比赛中面临不同的游戏设置,并且经常受到游戏补丁的影响。利用上下文信息的现有方法概括了上下文对整个人群的影响,但是为每个个体量身定制的上下文信息可能更有效。为了实现这一目标,我们提出了用于预测用户表现和游戏结果的神经个性化上下文感知嵌入(NICE)模型。我们提出的方法通过非负张量分解学习用户和上下文的潜在表征来识别不同背景下的个体行为差异。使用MOBA游戏《英雄联盟》的数据集,我们证明了我们的模型大大提高了对获胜结果、个人用户表现和用户粘性的预测。
{"title":"Individualized Context-Aware Tensor Factorization for Online Games Predictions","authors":"Julie Jiang, Kristina Lerman, Emilio Ferrara","doi":"10.1109/ICDMW51313.2020.00048","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00048","url":null,"abstract":"Individual behavior and decisions are substantially influenced by their contexts, such as location, environment, and time. Changes along these dimensions can be readily observed in Multiplayer Online Battle Arena games (MOBA), where players face different in-game settings for each match and are subject to frequent game patches. Existing methods utilizing contextual information generalize the effect of a context over the entire population, but contextual information tailored to each individual can be more effective. To achieve this, we present the Neural Individualized Context-aware Embeddings (NICE) model for predicting user performance and game outcomes. Our proposed method identifies individual behavioral differences in different contexts by learning latent representations of users and contexts through non-negative tensor factorization. Using a dataset from the MOBA game League of Legends, we demonstrate that our model substantially improves the prediction of winning outcome, individual user performance, and user engagement.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122400386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Recommender Algorithm: Gradient Recurrent Neural Network Applied to Yang-Baxter-Like Equation 一种推荐算法:应用于类yang - baxter方程的梯度递归神经网络
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00031
Ying Liufu, Long Jin, Mei Liu, Shuai Li
In this article, a traditional recommender algorithm termed gradient recurrent neural network (GRNN) model is introduced. Allowing for numerous practical problems such as the problems related to recommender systems or multi-agent systems that can be turned into matrix equation problems to resolve, the GRNN model becomes a more critical and promising role. The GRNN model, designed with the assistance of a square-norm-based energy function, is quite applicable to a recommender system and substantiated to be high-efficient in solving convex optimization linear or nonlinear problems. Simultaneously, implementing elaborately a theoretical analysis and numerical experiment computational simulation, the inherent exponential and stable convergence of the GRNN model is validated. With the aid of it, a theoretical nontrivial solution of the Yang-Baxter-like matrix equation $XAX=AXA$ can be obtained successfully.
本文介绍了一种传统的推荐算法——梯度递归神经网络模型。考虑到许多实际问题,例如与推荐系统或多智能体系统相关的问题,这些问题可以转化为矩阵方程问题来解决,GRNN模型变得更加关键和有前途。GRNN模型在基于平方范数的能量函数的辅助下设计,非常适用于推荐系统,并且在求解凸优化线性或非线性问题方面具有很高的效率。同时,通过理论分析和数值实验计算仿真,验证了GRNN模型固有的指数收敛性和稳定收敛性。在此基础上,成功地求出了类yang - baxter矩阵方程$XAX=AXA$的理论非平凡解。
{"title":"A Recommender Algorithm: Gradient Recurrent Neural Network Applied to Yang-Baxter-Like Equation","authors":"Ying Liufu, Long Jin, Mei Liu, Shuai Li","doi":"10.1109/ICDMW51313.2020.00031","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00031","url":null,"abstract":"In this article, a traditional recommender algorithm termed gradient recurrent neural network (GRNN) model is introduced. Allowing for numerous practical problems such as the problems related to recommender systems or multi-agent systems that can be turned into matrix equation problems to resolve, the GRNN model becomes a more critical and promising role. The GRNN model, designed with the assistance of a square-norm-based energy function, is quite applicable to a recommender system and substantiated to be high-efficient in solving convex optimization linear or nonlinear problems. Simultaneously, implementing elaborately a theoretical analysis and numerical experiment computational simulation, the inherent exponential and stable convergence of the GRNN model is validated. With the aid of it, a theoretical nontrivial solution of the Yang-Baxter-like matrix equation $XAX=AXA$ can be obtained successfully.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127802264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Precipitation Nowcasting Using Grid-based Data in South Korea Region 基于网格数据的韩国地区降水临近预报
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00099
Changhwan Kim, Seyoung Yun
Recently, precipitation nowcasting has gained significant attention. For instance, the demand for precise precipitation nowcasting is significantly increasing in South Korea since the economic damage has been severe in recent days because of frequent and unexpected heavy rainfall. In this paper, we propose a U-Net based deep learning model that predicts from a numerical model and then corrects the data using the U-Net based deep learning model so that it can improve the accuracy of the final prediction. We use two data sets: reanalysis data and LDAPS(Local Data Assimilation and Prediction System) prediction data. Both data sets are grid-based data that covers the whole South Korea region. We first experiment with reanalysis data to identify that our U-Net model can find atmospheric dynamics patterns, even if it is not image data. Next, we use LDAPS prediction data and apply it to the U-Net model. Because LDAPS prediction data is also a prediction, we essentially conduct correcting task for this data. To this aim, a learnable layer is added at the front of the U-Net model and concatenated with the input batch to learn location-specific information. The experiment shows that the U-Net based model can find patterns using reanalysis data. Further, it has the potential to improve the accuracy of LDAPS prediction data. We also find that the learnable layer enhances test accuracy.
近年来,降水临近预报引起了人们的广泛关注。例如,在韩国,由于频繁和意外的强降雨,最近几天经济损失严重,因此对精确降水临近预报的需求正在显著增加。在本文中,我们提出了一种基于U-Net的深度学习模型,该模型从数值模型进行预测,然后使用基于U-Net的深度学习模型对数据进行校正,从而提高最终预测的准确性。我们使用了两个数据集:再分析数据和LDAPS(Local data Assimilation and Prediction System)预测数据。这两个数据集都是基于网格的数据,覆盖了整个韩国地区。我们首先对再分析数据进行实验,以确定我们的U-Net模型可以找到大气动力学模式,即使它不是图像数据。接下来,我们使用LDAPS预测数据并将其应用于U-Net模型。由于LDAPS预测数据也是一种预测,我们本质上是对该数据进行校正任务。为此,在U-Net模型的前面添加一个可学习层,并与输入批连接以学习特定位置的信息。实验表明,基于U-Net的模型可以利用再分析数据找到模式。此外,它有可能提高LDAPS预测数据的准确性。我们还发现可学习层提高了测试精度。
{"title":"Precipitation Nowcasting Using Grid-based Data in South Korea Region","authors":"Changhwan Kim, Seyoung Yun","doi":"10.1109/ICDMW51313.2020.00099","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00099","url":null,"abstract":"Recently, precipitation nowcasting has gained significant attention. For instance, the demand for precise precipitation nowcasting is significantly increasing in South Korea since the economic damage has been severe in recent days because of frequent and unexpected heavy rainfall. In this paper, we propose a U-Net based deep learning model that predicts from a numerical model and then corrects the data using the U-Net based deep learning model so that it can improve the accuracy of the final prediction. We use two data sets: reanalysis data and LDAPS(Local Data Assimilation and Prediction System) prediction data. Both data sets are grid-based data that covers the whole South Korea region. We first experiment with reanalysis data to identify that our U-Net model can find atmospheric dynamics patterns, even if it is not image data. Next, we use LDAPS prediction data and apply it to the U-Net model. Because LDAPS prediction data is also a prediction, we essentially conduct correcting task for this data. To this aim, a learnable layer is added at the front of the U-Net model and concatenated with the input batch to learn location-specific information. The experiment shows that the U-Net based model can find patterns using reanalysis data. Further, it has the potential to improve the accuracy of LDAPS prediction data. We also find that the learnable layer enhances test accuracy.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130386452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Multi-class imbalanced semi-supervised learning from streams through online ensembles 从流到在线合奏的多类不平衡半监督学习
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00124
P. Vafaie, H. Viktor, W. Michalowski
Multi-class imbalance, in which the rates of instances in the various classes differ substantially, poses a major challenge when learning from evolving streams. In this setting, minority class instances may arrive infrequently and in bursts, making accurate model construction problematic. Further, skewed streams are not only susceptible to concept drifts, but class labels may also be absent, expensive to obtain, or only arrive after some delay. The combined effects of multi-class skew, concept drift and semi-supervised learning have received limited attention in the online learning community. In this paper, we introduce a multi-class online ensemble algorithm that is suitable for learning in such settings. Specifically, our algorithm uses sampling with replacement while dynamically increasing the weights of underrepresented classes based on recall in order to produce models that benefit all classes. Our approach addresses the potential lack of labels by incorporating a self-training semi-supervised learning method for labeling instances. Our experimental results show that our online ensemble performs well against multi-class imbalanced data containing concept drifts. In addition, our algorithm produces accurate predictions, even in the presence of unlabeled data.
多类不平衡,即不同类的实例率存在很大差异,这对从不断发展的流中学习构成了重大挑战。在这种情况下,少数类实例可能不经常出现,并且会突然出现,这使得准确的模型构建成为问题。此外,扭曲的流不仅容易受到概念漂移的影响,而且类标签也可能不存在,难以获得,或者在一段时间后才到达。多类偏差、概念漂移和半监督学习的综合效应在在线学习界受到的关注有限。在本文中,我们介绍了一种适合在这种情况下学习的多类在线集成算法。具体来说,我们的算法使用带有替换的抽样,同时基于召回动态地增加未被充分代表的类的权重,以产生对所有类都有利的模型。我们的方法通过结合标记实例的自训练半监督学习方法来解决潜在的标签缺乏问题。实验结果表明,我们的在线集成对包含概念漂移的多类不平衡数据有很好的处理效果。此外,即使存在未标记的数据,我们的算法也能产生准确的预测。
{"title":"Multi-class imbalanced semi-supervised learning from streams through online ensembles","authors":"P. Vafaie, H. Viktor, W. Michalowski","doi":"10.1109/ICDMW51313.2020.00124","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00124","url":null,"abstract":"Multi-class imbalance, in which the rates of instances in the various classes differ substantially, poses a major challenge when learning from evolving streams. In this setting, minority class instances may arrive infrequently and in bursts, making accurate model construction problematic. Further, skewed streams are not only susceptible to concept drifts, but class labels may also be absent, expensive to obtain, or only arrive after some delay. The combined effects of multi-class skew, concept drift and semi-supervised learning have received limited attention in the online learning community. In this paper, we introduce a multi-class online ensemble algorithm that is suitable for learning in such settings. Specifically, our algorithm uses sampling with replacement while dynamically increasing the weights of underrepresented classes based on recall in order to produce models that benefit all classes. Our approach addresses the potential lack of labels by incorporating a self-training semi-supervised learning method for labeling instances. Our experimental results show that our online ensemble performs well against multi-class imbalanced data containing concept drifts. In addition, our algorithm produces accurate predictions, even in the presence of unlabeled data.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131841475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Uncertain Time Series Classification with Shapelet Transform 基于Shapelet变换的不确定时间序列分类
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00044
Michael Franklin Mbouopda, E. Nguifo
Time series classification is a task that aims at classifying chronological data. It is used in a diverse range of domains such as meteorology, medicine and physics. In the last decade, many algorithms have been built to perform this task with very appreciable accuracy. However, applications where time series have uncertainty has been under-explored. Using uncertainty propagation techniques, we propose a new uncertain dissimilarity measure based on Euclidean distance. We then propose the uncertain shapelet transform algorithm for the classification of uncertain time series. The large experiments we conducted on state of the art datasets show the effectiveness of our contribution. The source code of our contribution and the datasets we used are all available on a public repository.
时间序列分类是一种对时间序列数据进行分类的任务。它被用于各种领域,如气象学、医学和物理学。在过去的十年中,已经建立了许多算法来执行这项任务,并且具有非常可观的精度。然而,时间序列具有不确定性的应用尚未得到充分探索。利用不确定性传播技术,提出了一种新的基于欧几里得距离的不确定性不相似度度量方法。在此基础上,提出了一种用于不确定时间序列分类的不确定小波变换算法。我们在最先进的数据集上进行的大型实验显示了我们的贡献的有效性。我们贡献的源代码和我们使用的数据集都可以在公共存储库中获得。
{"title":"Uncertain Time Series Classification with Shapelet Transform","authors":"Michael Franklin Mbouopda, E. Nguifo","doi":"10.1109/ICDMW51313.2020.00044","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00044","url":null,"abstract":"Time series classification is a task that aims at classifying chronological data. It is used in a diverse range of domains such as meteorology, medicine and physics. In the last decade, many algorithms have been built to perform this task with very appreciable accuracy. However, applications where time series have uncertainty has been under-explored. Using uncertainty propagation techniques, we propose a new uncertain dissimilarity measure based on Euclidean distance. We then propose the uncertain shapelet transform algorithm for the classification of uncertain time series. The large experiments we conducted on state of the art datasets show the effectiveness of our contribution. The source code of our contribution and the datasets we used are all available on a public repository.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128985651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Graph-based Topic Extraction Using Centroid Distance of Phrase Embeddings on Healthy Aging Open-ended Survey Questions 基于短语嵌入质心距离的健康老龄化开放式调查问题图主题提取
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00088
D. Kosmajac, Kirstie Smith, Vlado Keselj, S. Kirkland
Open-ended questions are a very important part of research surveys. However, they can pose a challenge when it comes to processing since manual processing requires a labour-intensive human effort. Automation of the task requires application of NLP methods since free text does not ensure standardized structure. To tackle this problem, we present a solution for topic discovery and analysis of open-ended survey items. We use graph-based representation of the text that adds structure and enables easier manipulation and keyphrase retrieval. Additionally, we use pre-trained fastText aligned word vectors to cluster similar phrases even if they are written in different languages. The goal is to produce topic word and phrase representatives that are easy to interpret by a domain expert. We compare the method with traditional LDA and two state-of-the-art algorithms: BTM and WNTM. The resulting keyphrases representing topics are more intuitive to the domain experts than the ones obtained by reference topic models in similar experimental settings.
开放式问题是研究性调查的重要组成部分。然而,当涉及到处理时,它们可能构成挑战,因为手动处理需要劳动密集型的人力。由于自由文本不能保证标准化的结构,任务的自动化需要应用自然语言处理方法。为了解决这个问题,我们提出了一个开放式调查项目的主题发现和分析方案。我们使用基于图形的文本表示,增加了结构,使操作和关键短语检索更容易。此外,我们使用预先训练的fastText对齐词向量来聚类相似的短语,即使它们是用不同的语言写的。目标是生成易于由领域专家解释的主题词和短语代表。我们将该方法与传统的LDA和两种最先进的算法:BTM和WNTM进行了比较。对于领域专家来说,所得到的代表主题的关键短语比在类似实验环境下通过参考主题模型获得的关键短语更直观。
{"title":"Graph-based Topic Extraction Using Centroid Distance of Phrase Embeddings on Healthy Aging Open-ended Survey Questions","authors":"D. Kosmajac, Kirstie Smith, Vlado Keselj, S. Kirkland","doi":"10.1109/ICDMW51313.2020.00088","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00088","url":null,"abstract":"Open-ended questions are a very important part of research surveys. However, they can pose a challenge when it comes to processing since manual processing requires a labour-intensive human effort. Automation of the task requires application of NLP methods since free text does not ensure standardized structure. To tackle this problem, we present a solution for topic discovery and analysis of open-ended survey items. We use graph-based representation of the text that adds structure and enables easier manipulation and keyphrase retrieval. Additionally, we use pre-trained fastText aligned word vectors to cluster similar phrases even if they are written in different languages. The goal is to produce topic word and phrase representatives that are easy to interpret by a domain expert. We compare the method with traditional LDA and two state-of-the-art algorithms: BTM and WNTM. The resulting keyphrases representing topics are more intuitive to the domain experts than the ones obtained by reference topic models in similar experimental settings.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122394878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Linking Personally Identifiable Information from the Dark Web to the Surface Web: A Deep Entity Resolution Approach 链接个人身份信息从暗网到表面网:一种深度实体解析方法
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00072
Fangyu Lin, Yizhi Liu, Mohammadreza Ebrahimi, Zara Ahmad-Post, J. Hu, Jingyu Xin, S. Samtani, Weifeng Li, Hsinchun Chen
The information privacy of the Internet users has become a major societal concern. The rapid growth of online services increases the risk of unauthorized access to Personally Identifiable Information (PII) of at-risk populations, who are unaware of their PII exposure. To proactively identify online at-risk populations and increase their privacy awareness, it is crucial to conduct a holistic privacy risk assessment across the internet. Current privacy risk assessment studies are limited to a single platform within either the surface web or the dark web. A comprehensive privacy risk assessment requires matching exposed PII on heterogeneous online platforms across the surface web and the dark web. However, due to the incompleteness and inaccuracy of PII records in each platform, linking the exposed PII to users is a non-trivial task. While Entity Resolution (ER) techniques can be used to facilitate this task, they often require ad-hoc, manual rule development and feature engineering. Recently, Deep Learning (DL)-based ER has outperformed manual entity matching rules by automatically extracting prominent features from incomplete or inaccurate records. In this study, we enhance the existing privacy risk assessment with a DL-based ER method, namely Multi-Context Attention (MCA), to comprehensively evaluate individuals' PII exposure across the different online platforms in the dark web and surface web. Evaluation against benchmark ER models indicates the efficacy of MCA. Using MCA on a random sample of data breach victims in the dark web, we are able to identify 4.3% of the victims on the surface web platforms and calculate their privacy risk scores.
互联网用户的信息隐私已成为一个重大的社会问题。在线服务的快速增长增加了未经授权访问风险人群的个人身份信息(PII)的风险,这些人群不知道他们的PII暴露。为了主动识别上网风险人群并提高他们的隐私意识,在互联网上进行全面的隐私风险评估至关重要。目前的隐私风险评估研究仅限于表面网或暗网内的单一平台。全面的隐私风险评估需要在跨表层网和暗网的异构在线平台上匹配暴露的个人身份信息。然而,由于每个平台中PII记录的不完整性和不准确性,将暴露的PII链接到用户是一项艰巨的任务。虽然实体解析(ER)技术可用于促进此任务,但它们通常需要特别的、手动的规则开发和特征工程。最近,基于深度学习(DL)的ER通过自动从不完整或不准确的记录中提取突出特征来优于手动实体匹配规则。在本研究中,我们使用基于dl的ER方法,即多上下文注意(Multi-Context Attention, MCA),对现有的隐私风险评估进行了改进,以综合评估个人在暗网和表层网不同在线平台上的PII暴露情况。对基准ER模型的评价表明了MCA的有效性。通过对暗网数据泄露受害者的随机样本使用MCA,我们能够识别出表面网络平台上4.3%的受害者,并计算出他们的隐私风险评分。
{"title":"Linking Personally Identifiable Information from the Dark Web to the Surface Web: A Deep Entity Resolution Approach","authors":"Fangyu Lin, Yizhi Liu, Mohammadreza Ebrahimi, Zara Ahmad-Post, J. Hu, Jingyu Xin, S. Samtani, Weifeng Li, Hsinchun Chen","doi":"10.1109/ICDMW51313.2020.00072","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00072","url":null,"abstract":"The information privacy of the Internet users has become a major societal concern. The rapid growth of online services increases the risk of unauthorized access to Personally Identifiable Information (PII) of at-risk populations, who are unaware of their PII exposure. To proactively identify online at-risk populations and increase their privacy awareness, it is crucial to conduct a holistic privacy risk assessment across the internet. Current privacy risk assessment studies are limited to a single platform within either the surface web or the dark web. A comprehensive privacy risk assessment requires matching exposed PII on heterogeneous online platforms across the surface web and the dark web. However, due to the incompleteness and inaccuracy of PII records in each platform, linking the exposed PII to users is a non-trivial task. While Entity Resolution (ER) techniques can be used to facilitate this task, they often require ad-hoc, manual rule development and feature engineering. Recently, Deep Learning (DL)-based ER has outperformed manual entity matching rules by automatically extracting prominent features from incomplete or inaccurate records. In this study, we enhance the existing privacy risk assessment with a DL-based ER method, namely Multi-Context Attention (MCA), to comprehensively evaluate individuals' PII exposure across the different online platforms in the dark web and surface web. Evaluation against benchmark ER models indicates the efficacy of MCA. Using MCA on a random sample of data breach victims in the dark web, we are able to identify 4.3% of the victims on the surface web platforms and calculate their privacy risk scores.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116406384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Partially Shared Semi-supervised Deep Matrix Factorization with Multi-view Data 多视图数据的部分共享半监督深度矩阵分解
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00081
Haonan Huang, Naiyao Liang, Wei Yan, Zuyuan Yang, Weijun Sun
Since many real-world data can be described from multiple views, multi-view learning has attracted considerable attention. Various methods have been proposed and successfully applied to multi-view learning, typically based on matrix factorization models. Recently, it is extended to the deep structure to exploit the hierarchical information of multi-view data, but the view-specific features and the label information are seldom considered. To address these concerns, we present a partially shared semi-supervised deep matrix factorization model (PSDMF). By integrating the partially shared deep decomposition structure, graph regularization and the semi-supervised regression model, PSDMF can learn a compact and discriminative representation through eliminating the effects of uncorrelated information. In addition, we develop an efficient iterative updating algorithm for PSDMF. Extensive experiments on five benchmark datasets demonstrate that PSDMF can achieve better performance than the state-of-the-art multi-view learning approaches. The MATLAB source code is available at https://github.com/libertyhhn/PartiallySharedDMF.
由于许多现实世界的数据可以从多个角度来描述,因此多视图学习引起了人们的广泛关注。已经提出了各种方法并成功地应用于多视图学习,通常是基于矩阵分解模型。近年来,为了挖掘多视图数据的层次信息,将其扩展到深层结构,但很少考虑特定于视图的特征和标签信息。为了解决这些问题,我们提出了一个部分共享半监督深度矩阵分解模型(PSDMF)。PSDMF将部分共享的深度分解结构、图正则化和半监督回归模型相结合,通过消除不相关信息的影响,学习到紧凑的判别表示。此外,我们还开发了一种高效的PSDMF迭代更新算法。在五个基准数据集上的大量实验表明,PSDMF比最先进的多视图学习方法可以获得更好的性能。MATLAB源代码可从https://github.com/libertyhhn/PartiallySharedDMF获得。
{"title":"Partially Shared Semi-supervised Deep Matrix Factorization with Multi-view Data","authors":"Haonan Huang, Naiyao Liang, Wei Yan, Zuyuan Yang, Weijun Sun","doi":"10.1109/ICDMW51313.2020.00081","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00081","url":null,"abstract":"Since many real-world data can be described from multiple views, multi-view learning has attracted considerable attention. Various methods have been proposed and successfully applied to multi-view learning, typically based on matrix factorization models. Recently, it is extended to the deep structure to exploit the hierarchical information of multi-view data, but the view-specific features and the label information are seldom considered. To address these concerns, we present a partially shared semi-supervised deep matrix factorization model (PSDMF). By integrating the partially shared deep decomposition structure, graph regularization and the semi-supervised regression model, PSDMF can learn a compact and discriminative representation through eliminating the effects of uncorrelated information. In addition, we develop an efficient iterative updating algorithm for PSDMF. Extensive experiments on five benchmark datasets demonstrate that PSDMF can achieve better performance than the state-of-the-art multi-view learning approaches. The MATLAB source code is available at https://github.com/libertyhhn/PartiallySharedDMF.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115365130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2020 International Conference on Data Mining Workshops (ICDMW)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1