首页 > 最新文献

2020 International Conference on Data Mining Workshops (ICDMW)最新文献

英文 中文
Scenario-aware and Mutual-based approach for Multi-scenario Recommendation in E-Commerce 基于场景感知的电子商务多场景推荐方法
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00027
Yuting Chen, Yanshi Wang, Yabo Ni, Anxiang Zeng, Lanfen Lin
Recommender systems (RSs) are essential for e-commerce platforms to help meet the enormous needs of users. How to capture user interests and make accurate recommendations for users in heterogeneous e-commerce scenarios is still a continuous research topic. However, most existing studies overlook the intrinsic association of the scenarios: the log data collected from platforms can be naturally divided into different scenarios (e.g., country, city, culture). We observed that the scenarios are heterogeneous because of the huge differences among them. Therefore, a unified model is difficult to effectively capture complex correlations (e.g., differences and similarities) between multiple scenarios thus seriously reducing the accuracy of recommendation results. In this paper, we target the problem of multi-scenario recommendation in e-commerce, and propose a novel recommendation model named Scenario-aware Mutual Learning (SAML) that leverages the differences and similarities between multiple scenarios. We first introduce scenario-aware feature representation, which transforms the embedding and attention modules to map the features into both global and scenario-specific subspace in parallel. Then we introduce an auxiliary network to model the shared knowledge across all scenarios, and use a multi-branch network to model differences among specific scenarios. Finally, we employ a novel mutual unit to adaptively learn the similarity between various scenarios and incorporate it into multi-branch network. We conduct extensive experiments on both public and industrial datasets, empirical results show that SAML consistently and significantly outperforms state-of-the-art methods.
推荐系统(RSs)对于电子商务平台来说是必不可少的,可以帮助满足用户的巨大需求。如何在异构电商场景下捕捉用户兴趣,为用户提供精准推荐,仍然是一个持续研究的课题。然而,现有的研究大多忽略了场景之间的内在关联:从平台收集的日志数据可以自然地划分为不同的场景(如国家、城市、文化)。我们观察到,由于它们之间的巨大差异,这些场景是异质的。因此,一个统一的模型很难有效地捕捉多个场景之间复杂的相关性(如差异性和相似性),从而严重降低了推荐结果的准确性。本文针对电子商务中的多场景推荐问题,提出了一种基于场景感知的互学习(SAML)推荐模型,该模型利用多场景之间的异同。我们首先引入场景感知特征表示,它转换嵌入和关注模块,将特征并行映射到全局和场景特定的子空间。然后引入辅助网络对所有场景的共享知识进行建模,并使用多分支网络对特定场景之间的差异进行建模。最后,我们采用一种新的互单元自适应学习不同场景之间的相似性,并将其整合到多分支网络中。我们在公共和工业数据集上进行了广泛的实验,实证结果表明SAML始终显著优于最先进的方法。
{"title":"Scenario-aware and Mutual-based approach for Multi-scenario Recommendation in E-Commerce","authors":"Yuting Chen, Yanshi Wang, Yabo Ni, Anxiang Zeng, Lanfen Lin","doi":"10.1109/ICDMW51313.2020.00027","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00027","url":null,"abstract":"Recommender systems (RSs) are essential for e-commerce platforms to help meet the enormous needs of users. How to capture user interests and make accurate recommendations for users in heterogeneous e-commerce scenarios is still a continuous research topic. However, most existing studies overlook the intrinsic association of the scenarios: the log data collected from platforms can be naturally divided into different scenarios (e.g., country, city, culture). We observed that the scenarios are heterogeneous because of the huge differences among them. Therefore, a unified model is difficult to effectively capture complex correlations (e.g., differences and similarities) between multiple scenarios thus seriously reducing the accuracy of recommendation results. In this paper, we target the problem of multi-scenario recommendation in e-commerce, and propose a novel recommendation model named Scenario-aware Mutual Learning (SAML) that leverages the differences and similarities between multiple scenarios. We first introduce scenario-aware feature representation, which transforms the embedding and attention modules to map the features into both global and scenario-specific subspace in parallel. Then we introduce an auxiliary network to model the shared knowledge across all scenarios, and use a multi-branch network to model differences among specific scenarios. Finally, we employ a novel mutual unit to adaptively learn the similarity between various scenarios and incorporate it into multi-branch network. We conduct extensive experiments on both public and industrial datasets, empirical results show that SAML consistently and significantly outperforms state-of-the-art methods.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127342103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
An unsupervised methodology for online drift detection in multivariate industrial datasets 多变量工业数据集在线漂移检测的无监督方法
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00061
Sarah Klein, Mathias Verbeke
Slight deviations in the evolution of measured parameters of industrial machinery or processes can signal performance degradations and upcoming failures. Therefore, the timely and accurate detection of these drifts is important, yet complicated by the fact that industrial datasets are often multivariate in nature, inherently dynamic and often noisy. In this paper, a robust drift detection approach is proposed that extends a semi-parametric log-likelihood detector with adaptive windowing, allowing to dynamically adapt to the newly incoming data over time. It is shown that the approach is more accurate and can strongly reduce the computation time when compared to non-adaptive approaches, while achieving a similar detection delay. When evaluated on an industrial data set, the methodology can compete with offline drift detection methods.
工业机械或工艺的测量参数的演变中的轻微偏差可以表明性能下降和即将发生的故障。因此,及时准确地检测这些漂移是很重要的,但由于工业数据集通常是多变量的,本质上是动态的,并且经常有噪声,这一事实使其变得复杂。本文提出了一种鲁棒漂移检测方法,该方法扩展了具有自适应窗的半参数对数似然检测器,允许随着时间的推移动态适应新传入的数据。结果表明,与非自适应方法相比,该方法更精确,可以大大减少计算时间,同时实现相似的检测延迟。当在工业数据集上进行评估时,该方法可以与离线漂移检测方法相竞争。
{"title":"An unsupervised methodology for online drift detection in multivariate industrial datasets","authors":"Sarah Klein, Mathias Verbeke","doi":"10.1109/ICDMW51313.2020.00061","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00061","url":null,"abstract":"Slight deviations in the evolution of measured parameters of industrial machinery or processes can signal performance degradations and upcoming failures. Therefore, the timely and accurate detection of these drifts is important, yet complicated by the fact that industrial datasets are often multivariate in nature, inherently dynamic and often noisy. In this paper, a robust drift detection approach is proposed that extends a semi-parametric log-likelihood detector with adaptive windowing, allowing to dynamically adapt to the newly incoming data over time. It is shown that the approach is more accurate and can strongly reduce the computation time when compared to non-adaptive approaches, while achieving a similar detection delay. When evaluated on an industrial data set, the methodology can compete with offline drift detection methods.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130568585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerated SGD for Tensor Decomposition of Sparse Count Data 稀疏计数数据张量分解的加速SGD
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00047
Huan He, Yuanzhe Xi, Joyce C. Ho
The rapid growth in the collection of high-dimensional data has led to the emergence of tensor decomposition, a powerful analysis method for the exploration of multidimensional data. Since tensor decomposition can extract hidden structures and capture underlying relationships between variables, it has been used successfully across a broad range of applications. However, tensor decomposition is a computationally expensive task, and existing methods developed to decompose large sparse tensors of count data are not efficient enough when being performed with limited computing resources. Therefore, we propose AS-CP, a novel algorithm to accelerate convergence of the stochastic gradient descent based CANDECOMP/PARAFAC (CP) decomposition model through an extrapolation method. The proposed framework can be easily parallelized in an asynchronous way. Our empirical results on three real-world datasets demonstrate that AS-CP decreases the total computation time and scales readily to large datasets without necessitating a high-performance computing platform or environment. The advantage of AS-CP over several state-of-the-art methods is also shown through a machine learning task as the computed factors by AS-CP can help identify better clinical characteristics from EHR data.
高维数据收集的快速增长导致了张量分解的出现,这是一种探索多维数据的强大分析方法。由于张量分解可以提取隐藏的结构并捕获变量之间的潜在关系,因此它已经成功地应用于广泛的应用中。然而,张量分解是一项计算成本很高的任务,现有的用于分解计数数据的大型稀疏张量的方法在计算资源有限的情况下效率不够高。因此,我们提出了一种新的AS-CP算法,通过外推法加速基于随机梯度下降的CANDECOMP/PARAFAC (CP)分解模型的收敛。提出的框架可以很容易地以异步方式并行化。我们在三个真实数据集上的经验结果表明,AS-CP减少了总计算时间,并且很容易扩展到大型数据集,而不需要高性能的计算平台或环境。as - cp比几种最先进的方法的优势也通过机器学习任务显示出来,因为as - cp计算的因素可以帮助从EHR数据中识别更好的临床特征。
{"title":"Accelerated SGD for Tensor Decomposition of Sparse Count Data","authors":"Huan He, Yuanzhe Xi, Joyce C. Ho","doi":"10.1109/ICDMW51313.2020.00047","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00047","url":null,"abstract":"The rapid growth in the collection of high-dimensional data has led to the emergence of tensor decomposition, a powerful analysis method for the exploration of multidimensional data. Since tensor decomposition can extract hidden structures and capture underlying relationships between variables, it has been used successfully across a broad range of applications. However, tensor decomposition is a computationally expensive task, and existing methods developed to decompose large sparse tensors of count data are not efficient enough when being performed with limited computing resources. Therefore, we propose AS-CP, a novel algorithm to accelerate convergence of the stochastic gradient descent based CANDECOMP/PARAFAC (CP) decomposition model through an extrapolation method. The proposed framework can be easily parallelized in an asynchronous way. Our empirical results on three real-world datasets demonstrate that AS-CP decreases the total computation time and scales readily to large datasets without necessitating a high-performance computing platform or environment. The advantage of AS-CP over several state-of-the-art methods is also shown through a machine learning task as the computed factors by AS-CP can help identify better clinical characteristics from EHR data.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132155835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LbR: A New Regression Architecture for Automated Feature Engineering LbR:自动化特征工程的一种新的回归体系结构
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00066
Meng Wang, Zhijun Ding, Meiqin Pan
In recent years, machine learning has developed rapidly and has been widely applied in many fields, such as finance and medical treatment. Many studies have shown that feature engineering is the most important part of machine learning and the most creative part of data science. However, in the traditional feature engineering step, it often requires the participation of experienced domain experts and is very time-consuming. Therefore, automatic feature engineering technology arises, aiming at improving the performance of the model by automatically generating high informative features without expert domain knowledge. However, in these methods, new features are generated by pre-defining a set of identical operators on datasets, ignoring the diversity of data sets. So there is room for improvement in performance. In this paper, we proposed a method named LbR (Label based Regression), which can fully mine correlations between feature pairs and then select feature pairs with high discrimination to generate informative features. We conducted many experiments to show that LbR has better performance and efficiency than other methods in different data sets and machine learning models.
近年来,机器学习发展迅速,在金融、医疗等诸多领域得到了广泛应用。许多研究表明,特征工程是机器学习中最重要的部分,也是数据科学中最具创造性的部分。然而,在传统的特征工程步骤中,通常需要有经验丰富的领域专家的参与,并且非常耗时。因此,自动特征工程技术应运而生,其目的是在不需要专家领域知识的情况下,通过自动生成高信息量的特征来提高模型的性能。然而,在这些方法中,通过在数据集上预先定义一组相同的操作符来生成新特征,而忽略了数据集的多样性。所以在性能上还有改进的空间。本文提出了一种基于标签的回归(LbR)方法,该方法可以充分挖掘特征对之间的相关性,然后选择判别率高的特征对生成信息特征。我们进行了大量的实验,证明LbR在不同的数据集和机器学习模型中比其他方法具有更好的性能和效率。
{"title":"LbR: A New Regression Architecture for Automated Feature Engineering","authors":"Meng Wang, Zhijun Ding, Meiqin Pan","doi":"10.1109/ICDMW51313.2020.00066","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00066","url":null,"abstract":"In recent years, machine learning has developed rapidly and has been widely applied in many fields, such as finance and medical treatment. Many studies have shown that feature engineering is the most important part of machine learning and the most creative part of data science. However, in the traditional feature engineering step, it often requires the participation of experienced domain experts and is very time-consuming. Therefore, automatic feature engineering technology arises, aiming at improving the performance of the model by automatically generating high informative features without expert domain knowledge. However, in these methods, new features are generated by pre-defining a set of identical operators on datasets, ignoring the diversity of data sets. So there is room for improvement in performance. In this paper, we proposed a method named LbR (Label based Regression), which can fully mine correlations between feature pairs and then select feature pairs with high discrimination to generate informative features. We conducted many experiments to show that LbR has better performance and efficiency than other methods in different data sets and machine learning models.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132485898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Learning Latent Correlation of Heterogeneous Sensors Using Attention based Temporal Convolutional Network 基于注意的时间卷积网络学习异构传感器的潜在相关性
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00076
Xin Wang, Yunji Liang, Zhiwen Yu, Bin Guo
Internet of Things devices have various sensors. These sensors are responsible for sensing the environmental information around the device in many ways, and more sensors will be deployed as the device develops. However, as a result of multiple sensor devices performing sensing work together, the sensing cost increases. In order to prevent the increase in sensing costs caused by more and more sensors on mobile devices, we began to study how to reduce the sensor number and also complete the corresponding sensing functions. A latent correlation between sensor data is our first task in sensor replacement. Therefore, we propose the attention-based temporal convolutional network (ATT-TCN) to learn the latent correlation. The experimental verification is performed on the collected sensor data set, and the experimental results prove that our proposed model can learn the latent correlation between heterogeneous sensor well. Our proposed ATT-TCN has better performance on the data set than the basic TCN model.
物联网设备有各种传感器。这些传感器负责以多种方式感知设备周围的环境信息,随着设备的发展,将部署更多的传感器。然而,由于多个传感器设备一起执行传感工作,传感成本增加。为了防止移动设备上越来越多的传感器带来的传感成本的增加,我们开始研究如何减少传感器的数量,同时完成相应的传感功能。传感器数据之间的潜在相关性是我们在传感器替换中的首要任务。因此,我们提出了基于注意的时间卷积网络(at - tcn)来学习潜在相关性。在采集到的传感器数据集上进行了实验验证,实验结果证明该模型能够很好地学习异构传感器之间的潜在相关性。我们提出的ATT-TCN在数据集上的性能优于基本的TCN模型。
{"title":"Learning Latent Correlation of Heterogeneous Sensors Using Attention based Temporal Convolutional Network","authors":"Xin Wang, Yunji Liang, Zhiwen Yu, Bin Guo","doi":"10.1109/ICDMW51313.2020.00076","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00076","url":null,"abstract":"Internet of Things devices have various sensors. These sensors are responsible for sensing the environmental information around the device in many ways, and more sensors will be deployed as the device develops. However, as a result of multiple sensor devices performing sensing work together, the sensing cost increases. In order to prevent the increase in sensing costs caused by more and more sensors on mobile devices, we began to study how to reduce the sensor number and also complete the corresponding sensing functions. A latent correlation between sensor data is our first task in sensor replacement. Therefore, we propose the attention-based temporal convolutional network (ATT-TCN) to learn the latent correlation. The experimental verification is performed on the collected sensor data set, and the experimental results prove that our proposed model can learn the latent correlation between heterogeneous sensor well. Our proposed ATT-TCN has better performance on the data set than the basic TCN model.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131001714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tasks performed in the legal domain through Deep Learning: A bibliometric review (1987–2020) 通过深度学习在法律领域执行的任务:文献计量学回顾(1987-2020)
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00113
A. Montelongo, J. Becker
Deep Learning (DL) has become the state-of-the-art method for Natural Language Processing (NLP). During the last 5 years DL became the primary Artificial Intelligence (AI) method in the legal domain. In this work we provide a systematic bibliometric review of the publications that have utilized DL as the primary methodology. In particular we analyzed the performed objectives (performed tasks), the corpus utilized to train the models and promising areas of research. The sample includes a total of 137 works published between 1987 and 2020. This analysis starts with the first DL models (formerly Neural Networks) in the legal domain until the latest articles in the ongoing year. Our results show an increment of 300% on the total number of publications during the last 5 years, mainly on information extraction and classification tasks. Moreover, classification is the category with most publications with 39% of the total sample. Finally, we have identified that summarization and text generation as promising areas of research. These findings show that DL in the legal domain is currently in a growing stage, and hence it will be a promising topic of research in the coming years.
深度学习(DL)已经成为自然语言处理(NLP)最先进的方法。在过去的五年中,深度学习成为法律领域主要的人工智能(AI)方法。在这项工作中,我们对利用深度学习作为主要方法的出版物进行了系统的文献计量学回顾。我们特别分析了执行的目标(执行的任务)、用于训练模型的语料库和有前途的研究领域。样本包括1987年至2020年间出版的137部作品。这个分析从法律领域的第一个深度学习模型(以前的神经网络)开始,直到今年的最新文章。我们的结果表明,在过去的5年中,出版物总数增加了300%,主要是在信息提取和分类任务上。此外,分类是出版物最多的类别,占总样本的39%。最后,我们确定摘要和文本生成是有前途的研究领域。这些发现表明,法律领域的深度学习目前正处于发展阶段,因此它将是未来几年研究的一个有前途的主题。
{"title":"Tasks performed in the legal domain through Deep Learning: A bibliometric review (1987–2020)","authors":"A. Montelongo, J. Becker","doi":"10.1109/ICDMW51313.2020.00113","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00113","url":null,"abstract":"Deep Learning (DL) has become the state-of-the-art method for Natural Language Processing (NLP). During the last 5 years DL became the primary Artificial Intelligence (AI) method in the legal domain. In this work we provide a systematic bibliometric review of the publications that have utilized DL as the primary methodology. In particular we analyzed the performed objectives (performed tasks), the corpus utilized to train the models and promising areas of research. The sample includes a total of 137 works published between 1987 and 2020. This analysis starts with the first DL models (formerly Neural Networks) in the legal domain until the latest articles in the ongoing year. Our results show an increment of 300% on the total number of publications during the last 5 years, mainly on information extraction and classification tasks. Moreover, classification is the category with most publications with 39% of the total sample. Finally, we have identified that summarization and text generation as promising areas of research. These findings show that DL in the legal domain is currently in a growing stage, and hence it will be a promising topic of research in the coming years.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115842266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Literature Review on Mining Cyberthreat Intelligence from Unstructured Texts 从非结构化文本中挖掘网络威胁情报的文献综述
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00075
Md. Rayhanur Rahman, Rezvan Mahdavi-Hezaveh, L. Williams
Cyberthreat defense mechanisms have become more proactive these days, and thus leading to the increasing incorporation of cyberthreat intelligence (CTI). Cybersecurity researchers and vendors are powering the CTI with large volumes of unstructured textual data containing information on threat events, threat techniques, and tactics. Hence, extracting cyberthreat-relevant information through text mining is an effective way to obtain actionable CTI to thwart cyberattacks. The goal of this research is to aid cybersecurity researchers understand the source, purpose, and approaches for mining cyberthreat intelligence from unstructured text through a literature review of peer-reviewed studies on this topic. We perform a literature review to identify and analyze existing research on mining CTI. By using search queries in the bibliographic databases, 28,484 articles are found. From those, 38 studies are identified through the filtering criteria which include removing duplicates, non-English, non-peer-reviewed articles, and articles not about mining CTI. We find that the most prominent sources of unstructured threat data are the threat reports, Twitter feeds, and posts from hackers and security experts. We also observe that security researchers mined CTI from unstructured sources to extract Indicator of Compromise (IoC), threat-related topic, and event detection. Finally, natural language processing (NLP) based approaches: topic classification; keyword identification; and semantic relationship extraction among the keywords are mostly availed in the selected studies to mine CTI information from unstructured threat sources.
如今,网络威胁防御机制变得更加主动,从而导致越来越多地纳入网络威胁情报(CTI)。网络安全研究人员和供应商正在为CTI提供大量非结构化文本数据,其中包含有关威胁事件、威胁技术和策略的信息。因此,通过文本挖掘提取网络威胁相关信息是获得可操作的CTI以阻止网络攻击的有效途径。本研究的目的是通过对该主题的同行评审研究的文献综述,帮助网络安全研究人员了解从非结构化文本中挖掘网络威胁情报的来源、目的和方法。我们进行了文献综述,以识别和分析现有的研究挖掘CTI。通过对文献数据库的检索,共检索到28,484篇文献。从中,通过过滤标准确定了38项研究,其中包括删除重复,非英语,非同行评审的文章以及与CTI无关的文章。我们发现,非结构化威胁数据最主要的来源是来自黑客和安全专家的威胁报告、Twitter feed和帖子。我们还观察到,安全研究人员从非结构化来源挖掘CTI,以提取妥协指标(IoC)、威胁相关主题和事件检测。最后,基于自然语言处理(NLP)的方法:主题分类;关键字识别;所选研究主要利用关键字之间的语义关系提取,从非结构化威胁源中挖掘CTI信息。
{"title":"A Literature Review on Mining Cyberthreat Intelligence from Unstructured Texts","authors":"Md. Rayhanur Rahman, Rezvan Mahdavi-Hezaveh, L. Williams","doi":"10.1109/ICDMW51313.2020.00075","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00075","url":null,"abstract":"Cyberthreat defense mechanisms have become more proactive these days, and thus leading to the increasing incorporation of cyberthreat intelligence (CTI). Cybersecurity researchers and vendors are powering the CTI with large volumes of unstructured textual data containing information on threat events, threat techniques, and tactics. Hence, extracting cyberthreat-relevant information through text mining is an effective way to obtain actionable CTI to thwart cyberattacks. The goal of this research is to aid cybersecurity researchers understand the source, purpose, and approaches for mining cyberthreat intelligence from unstructured text through a literature review of peer-reviewed studies on this topic. We perform a literature review to identify and analyze existing research on mining CTI. By using search queries in the bibliographic databases, 28,484 articles are found. From those, 38 studies are identified through the filtering criteria which include removing duplicates, non-English, non-peer-reviewed articles, and articles not about mining CTI. We find that the most prominent sources of unstructured threat data are the threat reports, Twitter feeds, and posts from hackers and security experts. We also observe that security researchers mined CTI from unstructured sources to extract Indicator of Compromise (IoC), threat-related topic, and event detection. Finally, natural language processing (NLP) based approaches: topic classification; keyword identification; and semantic relationship extraction among the keywords are mostly availed in the selected studies to mine CTI information from unstructured threat sources.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123164161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A Hierarchical Knowledge and Interest Propagation Network for Recommender Systems 推荐系统的层次知识和兴趣传播网络
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00026
Qinghong Chen, Huobin Tan, Guangyan Lin, Ze Wang
To address the data sparsity and cold start issues of collaborative filtering, side information, such as social network, knowledge graph, is introduced to recommender systems. Knowledge graph, as a sort of auxiliary and structural data, is full of semantic and logical connections among entities in the world. In this paper, we propose a Hierarchical Knowledge and Interest Propagation Network(HKIPN) for recommendation, where a new heterogeneous propagation method is presented. Specifically, HKIPN propagates knowledge and user interest simultaneously in a unified graph combined by user-item bipartite interaction graph and knowledge graph. During the propagation, a hierarchical method is devised to aggregate a node's high-order neighbors explicitly and concurrently. Besides, an attention mechanism is employed to discriminate the importance of neighbors. Furthermore, due to information decay in the process of propagation, the decay factor, as the weight of each hierarchical representation to compose the final user-and-item representations, is taken into account. We apply the proposed model to three benchmark datasets about movie, book, and music recommendation and compare it with state-of-the-art baselines. The experiment results and further studies demonstrate that our approach outperforms compelling recommender baselines.
为了解决协同过滤的数据稀疏性和冷启动问题,在推荐系统中引入了社会网络、知识图谱等侧信息。知识图谱作为一种辅助的、结构性的数据,是世界实体之间充满语义和逻辑联系的数据。本文提出了一种用于推荐的层次知识和兴趣传播网络(HKIPN),其中提出了一种新的异构传播方法。具体来说,HKIPN以用户-项目二部交互图和知识图相结合的统一图同时传播知识和用户兴趣。在传播过程中,设计了一种分层方法,显式地并发地聚合节点的高阶邻居。此外,采用注意机制来区分邻居的重要性。此外,由于信息在传播过程中会衰减,因此考虑了衰减因子作为组成最终用户-物品表示的每个层次表示的权重。我们将提出的模型应用于关于电影、书籍和音乐推荐的三个基准数据集,并将其与最先进的基线进行比较。实验结果和进一步的研究表明,我们的方法优于引人注目的推荐基线。
{"title":"A Hierarchical Knowledge and Interest Propagation Network for Recommender Systems","authors":"Qinghong Chen, Huobin Tan, Guangyan Lin, Ze Wang","doi":"10.1109/ICDMW51313.2020.00026","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00026","url":null,"abstract":"To address the data sparsity and cold start issues of collaborative filtering, side information, such as social network, knowledge graph, is introduced to recommender systems. Knowledge graph, as a sort of auxiliary and structural data, is full of semantic and logical connections among entities in the world. In this paper, we propose a Hierarchical Knowledge and Interest Propagation Network(HKIPN) for recommendation, where a new heterogeneous propagation method is presented. Specifically, HKIPN propagates knowledge and user interest simultaneously in a unified graph combined by user-item bipartite interaction graph and knowledge graph. During the propagation, a hierarchical method is devised to aggregate a node's high-order neighbors explicitly and concurrently. Besides, an attention mechanism is employed to discriminate the importance of neighbors. Furthermore, due to information decay in the process of propagation, the decay factor, as the weight of each hierarchical representation to compose the final user-and-item representations, is taken into account. We apply the proposed model to three benchmark datasets about movie, book, and music recommendation and compare it with state-of-the-art baselines. The experiment results and further studies demonstrate that our approach outperforms compelling recommender baselines.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123757788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Anomaly Detection of Periodic Multivariate Time Series under High Acquisition Frequency Scene in IoT 物联网高采集频率场景下周期多元时间序列异常检测
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00078
Shuo Zhang, Xiaofei Chen, Jiayuan Chen, Qiao Jiang, Hejiao Huang
Anomaly Detection of Multivariate Time Series is an intensive research topic in data mining, especially with the rise of Industry 4.0. However, few existing approaches are taken under high acquisition scene, and only a minority of them took periodicity of time series into consideration. In this paper, we propose a novel network Dual-window RNN-CNN to detect periodic time series anomalies of high acquisition frequency scene in IoT. We first apply Dual-window to segment time series according to the periodicity of data and solve the time alignment problem. Then we utilize Multi-head GRU to compress the data volume and extract temporal features sensor by sensor, which not only solves the problems caused by high acquisition but also adds more flexible transfer ability to our network. In order to improve the robustness of our network in different periodic scenes of IoT, three different kinds of GRU mode are put forward. Finally we use CNN-based Autoencoder to locate anomalies according to both temporal and spatial dependencies. It should also be note that Multi-head GRU broadens the receptive field of CNN-based Autoencoder. Two parts of experiment were carried to verify the validity of Dual-Window RNN-CNN. The first part is conducted on UCR/UEA benchmark to discuss the performance of Dual-Window RNN-CNN under different structures and hyper parameters, for datasets in UCR/UAE benchmark contain enough timestamps to monitor the high acquisition and periodicity in IoT. The second part is conducted on Yahoo Webscope benchmark and NAB to compare our network with other classic time series anomaly detection approaches. Experiment results confirm that our Dual-Window RNN-CNN outperforms other approaches in anomaly detection of periodic multivariate time series, demonstrating the advantages of our network in high acquisition scene.
随着工业4.0的兴起,多变量时间序列异常检测是数据挖掘领域的一个热点研究课题。然而,现有的方法很少考虑高采集场景,而且只有少数方法考虑了时间序列的周期性。在本文中,我们提出了一种新的双窗口RNN-CNN网络来检测物联网中高采集频率场景的周期性时间序列异常。首先根据数据的周期性,应用双窗口对时间序列进行分割,解决时间对齐问题。然后利用Multi-head GRU压缩数据量,逐个传感器提取时间特征,既解决了高采集带来的问题,又使网络具有更灵活的传输能力。为了提高网络在物联网不同周期场景下的鲁棒性,提出了三种不同的GRU模式。最后利用基于cnn的自编码器根据时间和空间依赖关系定位异常。值得注意的是,多头GRU拓宽了基于cnn的自编码器的接受域。通过两部分实验验证了双窗口RNN-CNN的有效性。第一部分是在UCR/UEA基准测试上进行的,讨论了双窗口RNN-CNN在不同结构和超参数下的性能,因为UCR/UAE基准测试中的数据集包含足够的时间戳来监控物联网中的高采集和周期性。第二部分在Yahoo Webscope基准测试和NAB上进行,将我们的网络与其他经典的时间序列异常检测方法进行比较。实验结果证实,我们的双窗口RNN-CNN在周期性多元时间序列异常检测方面优于其他方法,证明了我们的网络在高采集场景下的优势。
{"title":"Anomaly Detection of Periodic Multivariate Time Series under High Acquisition Frequency Scene in IoT","authors":"Shuo Zhang, Xiaofei Chen, Jiayuan Chen, Qiao Jiang, Hejiao Huang","doi":"10.1109/ICDMW51313.2020.00078","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00078","url":null,"abstract":"Anomaly Detection of Multivariate Time Series is an intensive research topic in data mining, especially with the rise of Industry 4.0. However, few existing approaches are taken under high acquisition scene, and only a minority of them took periodicity of time series into consideration. In this paper, we propose a novel network Dual-window RNN-CNN to detect periodic time series anomalies of high acquisition frequency scene in IoT. We first apply Dual-window to segment time series according to the periodicity of data and solve the time alignment problem. Then we utilize Multi-head GRU to compress the data volume and extract temporal features sensor by sensor, which not only solves the problems caused by high acquisition but also adds more flexible transfer ability to our network. In order to improve the robustness of our network in different periodic scenes of IoT, three different kinds of GRU mode are put forward. Finally we use CNN-based Autoencoder to locate anomalies according to both temporal and spatial dependencies. It should also be note that Multi-head GRU broadens the receptive field of CNN-based Autoencoder. Two parts of experiment were carried to verify the validity of Dual-Window RNN-CNN. The first part is conducted on UCR/UEA benchmark to discuss the performance of Dual-Window RNN-CNN under different structures and hyper parameters, for datasets in UCR/UAE benchmark contain enough timestamps to monitor the high acquisition and periodicity in IoT. The second part is conducted on Yahoo Webscope benchmark and NAB to compare our network with other classic time series anomaly detection approaches. Experiment results confirm that our Dual-Window RNN-CNN outperforms other approaches in anomaly detection of periodic multivariate time series, demonstrating the advantages of our network in high acquisition scene.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124852819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
DQN-based Join Order Optimization by Learning Experiences of Running Queries on Spark SQL 基于dqn的连接顺序优化——学习在Spark SQL上运行查询的经验
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00107
Kyeong-Min Lee, InA Kim, Kyu-Chul Lee
In a smart grid, various types of queries such as ad-hoc queries and analytic queries are requested for data. There is a limit to query evaluation based on a single node database engines because queries are requested for a large scale of data in the smart grid. In this paper, to improve the performance of retrieving a large scale of data in the smart grid environment, we propose a DQN-based join order optimization model on Spark SQL. The model learns the actual processing time of queries that are evaluated on Spark SQL, not the estimated costs. By learning the optimal join orders from previous experiences, we optimize the join orders with similar performance to Spark SQL without collecting and computing the statistics of an input data set.
在智能网格中,数据会被请求各种类型的查询,例如临时查询和分析查询。基于单节点数据库引擎的查询评估存在局限性,因为智能电网中的查询请求是针对大规模数据的。为了提高智能电网环境下大规模数据检索的性能,本文提出了一种基于dqn的Spark SQL连接顺序优化模型。该模型学习在Spark SQL上评估的查询的实际处理时间,而不是估计的成本。通过从以前的经验中学习最优连接顺序,我们优化了与Spark SQL性能相似的连接顺序,而无需收集和计算输入数据集的统计信息。
{"title":"DQN-based Join Order Optimization by Learning Experiences of Running Queries on Spark SQL","authors":"Kyeong-Min Lee, InA Kim, Kyu-Chul Lee","doi":"10.1109/ICDMW51313.2020.00107","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00107","url":null,"abstract":"In a smart grid, various types of queries such as ad-hoc queries and analytic queries are requested for data. There is a limit to query evaluation based on a single node database engines because queries are requested for a large scale of data in the smart grid. In this paper, to improve the performance of retrieving a large scale of data in the smart grid environment, we propose a DQN-based join order optimization model on Spark SQL. The model learns the actual processing time of queries that are evaluated on Spark SQL, not the estimated costs. By learning the optimal join orders from previous experiences, we optimize the join orders with similar performance to Spark SQL without collecting and computing the statistics of an input data set.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125880840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2020 International Conference on Data Mining Workshops (ICDMW)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1