Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00027
Yuting Chen, Yanshi Wang, Yabo Ni, Anxiang Zeng, Lanfen Lin
Recommender systems (RSs) are essential for e-commerce platforms to help meet the enormous needs of users. How to capture user interests and make accurate recommendations for users in heterogeneous e-commerce scenarios is still a continuous research topic. However, most existing studies overlook the intrinsic association of the scenarios: the log data collected from platforms can be naturally divided into different scenarios (e.g., country, city, culture). We observed that the scenarios are heterogeneous because of the huge differences among them. Therefore, a unified model is difficult to effectively capture complex correlations (e.g., differences and similarities) between multiple scenarios thus seriously reducing the accuracy of recommendation results. In this paper, we target the problem of multi-scenario recommendation in e-commerce, and propose a novel recommendation model named Scenario-aware Mutual Learning (SAML) that leverages the differences and similarities between multiple scenarios. We first introduce scenario-aware feature representation, which transforms the embedding and attention modules to map the features into both global and scenario-specific subspace in parallel. Then we introduce an auxiliary network to model the shared knowledge across all scenarios, and use a multi-branch network to model differences among specific scenarios. Finally, we employ a novel mutual unit to adaptively learn the similarity between various scenarios and incorporate it into multi-branch network. We conduct extensive experiments on both public and industrial datasets, empirical results show that SAML consistently and significantly outperforms state-of-the-art methods.
{"title":"Scenario-aware and Mutual-based approach for Multi-scenario Recommendation in E-Commerce","authors":"Yuting Chen, Yanshi Wang, Yabo Ni, Anxiang Zeng, Lanfen Lin","doi":"10.1109/ICDMW51313.2020.00027","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00027","url":null,"abstract":"Recommender systems (RSs) are essential for e-commerce platforms to help meet the enormous needs of users. How to capture user interests and make accurate recommendations for users in heterogeneous e-commerce scenarios is still a continuous research topic. However, most existing studies overlook the intrinsic association of the scenarios: the log data collected from platforms can be naturally divided into different scenarios (e.g., country, city, culture). We observed that the scenarios are heterogeneous because of the huge differences among them. Therefore, a unified model is difficult to effectively capture complex correlations (e.g., differences and similarities) between multiple scenarios thus seriously reducing the accuracy of recommendation results. In this paper, we target the problem of multi-scenario recommendation in e-commerce, and propose a novel recommendation model named Scenario-aware Mutual Learning (SAML) that leverages the differences and similarities between multiple scenarios. We first introduce scenario-aware feature representation, which transforms the embedding and attention modules to map the features into both global and scenario-specific subspace in parallel. Then we introduce an auxiliary network to model the shared knowledge across all scenarios, and use a multi-branch network to model differences among specific scenarios. Finally, we employ a novel mutual unit to adaptively learn the similarity between various scenarios and incorporate it into multi-branch network. We conduct extensive experiments on both public and industrial datasets, empirical results show that SAML consistently and significantly outperforms state-of-the-art methods.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127342103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00061
Sarah Klein, Mathias Verbeke
Slight deviations in the evolution of measured parameters of industrial machinery or processes can signal performance degradations and upcoming failures. Therefore, the timely and accurate detection of these drifts is important, yet complicated by the fact that industrial datasets are often multivariate in nature, inherently dynamic and often noisy. In this paper, a robust drift detection approach is proposed that extends a semi-parametric log-likelihood detector with adaptive windowing, allowing to dynamically adapt to the newly incoming data over time. It is shown that the approach is more accurate and can strongly reduce the computation time when compared to non-adaptive approaches, while achieving a similar detection delay. When evaluated on an industrial data set, the methodology can compete with offline drift detection methods.
{"title":"An unsupervised methodology for online drift detection in multivariate industrial datasets","authors":"Sarah Klein, Mathias Verbeke","doi":"10.1109/ICDMW51313.2020.00061","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00061","url":null,"abstract":"Slight deviations in the evolution of measured parameters of industrial machinery or processes can signal performance degradations and upcoming failures. Therefore, the timely and accurate detection of these drifts is important, yet complicated by the fact that industrial datasets are often multivariate in nature, inherently dynamic and often noisy. In this paper, a robust drift detection approach is proposed that extends a semi-parametric log-likelihood detector with adaptive windowing, allowing to dynamically adapt to the newly incoming data over time. It is shown that the approach is more accurate and can strongly reduce the computation time when compared to non-adaptive approaches, while achieving a similar detection delay. When evaluated on an industrial data set, the methodology can compete with offline drift detection methods.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130568585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00047
Huan He, Yuanzhe Xi, Joyce C. Ho
The rapid growth in the collection of high-dimensional data has led to the emergence of tensor decomposition, a powerful analysis method for the exploration of multidimensional data. Since tensor decomposition can extract hidden structures and capture underlying relationships between variables, it has been used successfully across a broad range of applications. However, tensor decomposition is a computationally expensive task, and existing methods developed to decompose large sparse tensors of count data are not efficient enough when being performed with limited computing resources. Therefore, we propose AS-CP, a novel algorithm to accelerate convergence of the stochastic gradient descent based CANDECOMP/PARAFAC (CP) decomposition model through an extrapolation method. The proposed framework can be easily parallelized in an asynchronous way. Our empirical results on three real-world datasets demonstrate that AS-CP decreases the total computation time and scales readily to large datasets without necessitating a high-performance computing platform or environment. The advantage of AS-CP over several state-of-the-art methods is also shown through a machine learning task as the computed factors by AS-CP can help identify better clinical characteristics from EHR data.
{"title":"Accelerated SGD for Tensor Decomposition of Sparse Count Data","authors":"Huan He, Yuanzhe Xi, Joyce C. Ho","doi":"10.1109/ICDMW51313.2020.00047","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00047","url":null,"abstract":"The rapid growth in the collection of high-dimensional data has led to the emergence of tensor decomposition, a powerful analysis method for the exploration of multidimensional data. Since tensor decomposition can extract hidden structures and capture underlying relationships between variables, it has been used successfully across a broad range of applications. However, tensor decomposition is a computationally expensive task, and existing methods developed to decompose large sparse tensors of count data are not efficient enough when being performed with limited computing resources. Therefore, we propose AS-CP, a novel algorithm to accelerate convergence of the stochastic gradient descent based CANDECOMP/PARAFAC (CP) decomposition model through an extrapolation method. The proposed framework can be easily parallelized in an asynchronous way. Our empirical results on three real-world datasets demonstrate that AS-CP decreases the total computation time and scales readily to large datasets without necessitating a high-performance computing platform or environment. The advantage of AS-CP over several state-of-the-art methods is also shown through a machine learning task as the computed factors by AS-CP can help identify better clinical characteristics from EHR data.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132155835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00066
Meng Wang, Zhijun Ding, Meiqin Pan
In recent years, machine learning has developed rapidly and has been widely applied in many fields, such as finance and medical treatment. Many studies have shown that feature engineering is the most important part of machine learning and the most creative part of data science. However, in the traditional feature engineering step, it often requires the participation of experienced domain experts and is very time-consuming. Therefore, automatic feature engineering technology arises, aiming at improving the performance of the model by automatically generating high informative features without expert domain knowledge. However, in these methods, new features are generated by pre-defining a set of identical operators on datasets, ignoring the diversity of data sets. So there is room for improvement in performance. In this paper, we proposed a method named LbR (Label based Regression), which can fully mine correlations between feature pairs and then select feature pairs with high discrimination to generate informative features. We conducted many experiments to show that LbR has better performance and efficiency than other methods in different data sets and machine learning models.
{"title":"LbR: A New Regression Architecture for Automated Feature Engineering","authors":"Meng Wang, Zhijun Ding, Meiqin Pan","doi":"10.1109/ICDMW51313.2020.00066","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00066","url":null,"abstract":"In recent years, machine learning has developed rapidly and has been widely applied in many fields, such as finance and medical treatment. Many studies have shown that feature engineering is the most important part of machine learning and the most creative part of data science. However, in the traditional feature engineering step, it often requires the participation of experienced domain experts and is very time-consuming. Therefore, automatic feature engineering technology arises, aiming at improving the performance of the model by automatically generating high informative features without expert domain knowledge. However, in these methods, new features are generated by pre-defining a set of identical operators on datasets, ignoring the diversity of data sets. So there is room for improvement in performance. In this paper, we proposed a method named LbR (Label based Regression), which can fully mine correlations between feature pairs and then select feature pairs with high discrimination to generate informative features. We conducted many experiments to show that LbR has better performance and efficiency than other methods in different data sets and machine learning models.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132485898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00076
Xin Wang, Yunji Liang, Zhiwen Yu, Bin Guo
Internet of Things devices have various sensors. These sensors are responsible for sensing the environmental information around the device in many ways, and more sensors will be deployed as the device develops. However, as a result of multiple sensor devices performing sensing work together, the sensing cost increases. In order to prevent the increase in sensing costs caused by more and more sensors on mobile devices, we began to study how to reduce the sensor number and also complete the corresponding sensing functions. A latent correlation between sensor data is our first task in sensor replacement. Therefore, we propose the attention-based temporal convolutional network (ATT-TCN) to learn the latent correlation. The experimental verification is performed on the collected sensor data set, and the experimental results prove that our proposed model can learn the latent correlation between heterogeneous sensor well. Our proposed ATT-TCN has better performance on the data set than the basic TCN model.
{"title":"Learning Latent Correlation of Heterogeneous Sensors Using Attention based Temporal Convolutional Network","authors":"Xin Wang, Yunji Liang, Zhiwen Yu, Bin Guo","doi":"10.1109/ICDMW51313.2020.00076","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00076","url":null,"abstract":"Internet of Things devices have various sensors. These sensors are responsible for sensing the environmental information around the device in many ways, and more sensors will be deployed as the device develops. However, as a result of multiple sensor devices performing sensing work together, the sensing cost increases. In order to prevent the increase in sensing costs caused by more and more sensors on mobile devices, we began to study how to reduce the sensor number and also complete the corresponding sensing functions. A latent correlation between sensor data is our first task in sensor replacement. Therefore, we propose the attention-based temporal convolutional network (ATT-TCN) to learn the latent correlation. The experimental verification is performed on the collected sensor data set, and the experimental results prove that our proposed model can learn the latent correlation between heterogeneous sensor well. Our proposed ATT-TCN has better performance on the data set than the basic TCN model.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131001714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00113
A. Montelongo, J. Becker
Deep Learning (DL) has become the state-of-the-art method for Natural Language Processing (NLP). During the last 5 years DL became the primary Artificial Intelligence (AI) method in the legal domain. In this work we provide a systematic bibliometric review of the publications that have utilized DL as the primary methodology. In particular we analyzed the performed objectives (performed tasks), the corpus utilized to train the models and promising areas of research. The sample includes a total of 137 works published between 1987 and 2020. This analysis starts with the first DL models (formerly Neural Networks) in the legal domain until the latest articles in the ongoing year. Our results show an increment of 300% on the total number of publications during the last 5 years, mainly on information extraction and classification tasks. Moreover, classification is the category with most publications with 39% of the total sample. Finally, we have identified that summarization and text generation as promising areas of research. These findings show that DL in the legal domain is currently in a growing stage, and hence it will be a promising topic of research in the coming years.
{"title":"Tasks performed in the legal domain through Deep Learning: A bibliometric review (1987–2020)","authors":"A. Montelongo, J. Becker","doi":"10.1109/ICDMW51313.2020.00113","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00113","url":null,"abstract":"Deep Learning (DL) has become the state-of-the-art method for Natural Language Processing (NLP). During the last 5 years DL became the primary Artificial Intelligence (AI) method in the legal domain. In this work we provide a systematic bibliometric review of the publications that have utilized DL as the primary methodology. In particular we analyzed the performed objectives (performed tasks), the corpus utilized to train the models and promising areas of research. The sample includes a total of 137 works published between 1987 and 2020. This analysis starts with the first DL models (formerly Neural Networks) in the legal domain until the latest articles in the ongoing year. Our results show an increment of 300% on the total number of publications during the last 5 years, mainly on information extraction and classification tasks. Moreover, classification is the category with most publications with 39% of the total sample. Finally, we have identified that summarization and text generation as promising areas of research. These findings show that DL in the legal domain is currently in a growing stage, and hence it will be a promising topic of research in the coming years.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115842266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00026
Qinghong Chen, Huobin Tan, Guangyan Lin, Ze Wang
To address the data sparsity and cold start issues of collaborative filtering, side information, such as social network, knowledge graph, is introduced to recommender systems. Knowledge graph, as a sort of auxiliary and structural data, is full of semantic and logical connections among entities in the world. In this paper, we propose a Hierarchical Knowledge and Interest Propagation Network(HKIPN) for recommendation, where a new heterogeneous propagation method is presented. Specifically, HKIPN propagates knowledge and user interest simultaneously in a unified graph combined by user-item bipartite interaction graph and knowledge graph. During the propagation, a hierarchical method is devised to aggregate a node's high-order neighbors explicitly and concurrently. Besides, an attention mechanism is employed to discriminate the importance of neighbors. Furthermore, due to information decay in the process of propagation, the decay factor, as the weight of each hierarchical representation to compose the final user-and-item representations, is taken into account. We apply the proposed model to three benchmark datasets about movie, book, and music recommendation and compare it with state-of-the-art baselines. The experiment results and further studies demonstrate that our approach outperforms compelling recommender baselines.
{"title":"A Hierarchical Knowledge and Interest Propagation Network for Recommender Systems","authors":"Qinghong Chen, Huobin Tan, Guangyan Lin, Ze Wang","doi":"10.1109/ICDMW51313.2020.00026","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00026","url":null,"abstract":"To address the data sparsity and cold start issues of collaborative filtering, side information, such as social network, knowledge graph, is introduced to recommender systems. Knowledge graph, as a sort of auxiliary and structural data, is full of semantic and logical connections among entities in the world. In this paper, we propose a Hierarchical Knowledge and Interest Propagation Network(HKIPN) for recommendation, where a new heterogeneous propagation method is presented. Specifically, HKIPN propagates knowledge and user interest simultaneously in a unified graph combined by user-item bipartite interaction graph and knowledge graph. During the propagation, a hierarchical method is devised to aggregate a node's high-order neighbors explicitly and concurrently. Besides, an attention mechanism is employed to discriminate the importance of neighbors. Furthermore, due to information decay in the process of propagation, the decay factor, as the weight of each hierarchical representation to compose the final user-and-item representations, is taken into account. We apply the proposed model to three benchmark datasets about movie, book, and music recommendation and compare it with state-of-the-art baselines. The experiment results and further studies demonstrate that our approach outperforms compelling recommender baselines.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123757788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00058
Kin-Hon Ho, Wai-Han Chiu, Chin Li
We conduct a network analysis with centrality measures, using historical daily close prices of top 120 cryptocurrencies between 2013 and 2020, to study and understand the dynamic evolution and characteristics of the cryptocurrency market. Our study has three primary findings: (1) the overall cross-return correlation among the cryptocurrencies is weakening from 2013 to 2016 and then strengthening thereafter; (2) cryptocurrencies that are primarily used for transaction payment, notably BTC, dominate the market until mid-2016, followed by those developed for applications using blockchain as the underlying technology, particularly data storage and recording such as MAID and FCT, between mid-2016 and mid-2017. Since then, ETH, alongside with its strongly correlated cryptocurrencies have replaced BTC to become the benchmark cryptocurrencies. Furthermore, during the outbreak of COVID-19, QTUM and BNB have intermittently replaced ETH to take the leading positions due to their active community engagement during the pandemic; (3) centrality measures are useful features in improving the prediction accuracy of the short-term cryptocurrency price movement.
{"title":"A Short-Term Cryptocurrency Price Movement Prediction Using Centrality Measures","authors":"Kin-Hon Ho, Wai-Han Chiu, Chin Li","doi":"10.1109/ICDMW51313.2020.00058","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00058","url":null,"abstract":"We conduct a network analysis with centrality measures, using historical daily close prices of top 120 cryptocurrencies between 2013 and 2020, to study and understand the dynamic evolution and characteristics of the cryptocurrency market. Our study has three primary findings: (1) the overall cross-return correlation among the cryptocurrencies is weakening from 2013 to 2016 and then strengthening thereafter; (2) cryptocurrencies that are primarily used for transaction payment, notably BTC, dominate the market until mid-2016, followed by those developed for applications using blockchain as the underlying technology, particularly data storage and recording such as MAID and FCT, between mid-2016 and mid-2017. Since then, ETH, alongside with its strongly correlated cryptocurrencies have replaced BTC to become the benchmark cryptocurrencies. Furthermore, during the outbreak of COVID-19, QTUM and BNB have intermittently replaced ETH to take the leading positions due to their active community engagement during the pandemic; (3) centrality measures are useful features in improving the prediction accuracy of the short-term cryptocurrency price movement.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114792531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00075
Md. Rayhanur Rahman, Rezvan Mahdavi-Hezaveh, L. Williams
Cyberthreat defense mechanisms have become more proactive these days, and thus leading to the increasing incorporation of cyberthreat intelligence (CTI). Cybersecurity researchers and vendors are powering the CTI with large volumes of unstructured textual data containing information on threat events, threat techniques, and tactics. Hence, extracting cyberthreat-relevant information through text mining is an effective way to obtain actionable CTI to thwart cyberattacks. The goal of this research is to aid cybersecurity researchers understand the source, purpose, and approaches for mining cyberthreat intelligence from unstructured text through a literature review of peer-reviewed studies on this topic. We perform a literature review to identify and analyze existing research on mining CTI. By using search queries in the bibliographic databases, 28,484 articles are found. From those, 38 studies are identified through the filtering criteria which include removing duplicates, non-English, non-peer-reviewed articles, and articles not about mining CTI. We find that the most prominent sources of unstructured threat data are the threat reports, Twitter feeds, and posts from hackers and security experts. We also observe that security researchers mined CTI from unstructured sources to extract Indicator of Compromise (IoC), threat-related topic, and event detection. Finally, natural language processing (NLP) based approaches: topic classification; keyword identification; and semantic relationship extraction among the keywords are mostly availed in the selected studies to mine CTI information from unstructured threat sources.
{"title":"A Literature Review on Mining Cyberthreat Intelligence from Unstructured Texts","authors":"Md. Rayhanur Rahman, Rezvan Mahdavi-Hezaveh, L. Williams","doi":"10.1109/ICDMW51313.2020.00075","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00075","url":null,"abstract":"Cyberthreat defense mechanisms have become more proactive these days, and thus leading to the increasing incorporation of cyberthreat intelligence (CTI). Cybersecurity researchers and vendors are powering the CTI with large volumes of unstructured textual data containing information on threat events, threat techniques, and tactics. Hence, extracting cyberthreat-relevant information through text mining is an effective way to obtain actionable CTI to thwart cyberattacks. The goal of this research is to aid cybersecurity researchers understand the source, purpose, and approaches for mining cyberthreat intelligence from unstructured text through a literature review of peer-reviewed studies on this topic. We perform a literature review to identify and analyze existing research on mining CTI. By using search queries in the bibliographic databases, 28,484 articles are found. From those, 38 studies are identified through the filtering criteria which include removing duplicates, non-English, non-peer-reviewed articles, and articles not about mining CTI. We find that the most prominent sources of unstructured threat data are the threat reports, Twitter feeds, and posts from hackers and security experts. We also observe that security researchers mined CTI from unstructured sources to extract Indicator of Compromise (IoC), threat-related topic, and event detection. Finally, natural language processing (NLP) based approaches: topic classification; keyword identification; and semantic relationship extraction among the keywords are mostly availed in the selected studies to mine CTI information from unstructured threat sources.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123164161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00019
L. Subhashini, Yuefeng Li, Jinglan Zhang, Ajantha S Atukorale
The problem of uncertainty is a challenging issue to solve in opinion mining models. Existing models that use machine learning algorithms are unable to identify uncertainty within online customer reviews because of broad uncertain boundaries. Many researchers have developed fuzzy models to solve this problem. However, the problem of large uncertain boundaries remains with fuzzy models. The common challenging issue is that there is a big uncertain boundary between positive and negative classes as user reviews (or opinions) include many uncertainties. Dealing with these uncertainties is problematic due in many frequently used words may be non-relevant. This paper proposes a three-way based framework which integrates fuzzy concepts and deep learning together to solve the problem of uncertainty. Many experiments were conducted using movie review and ebook review datasets. The experimental results show that the proposed three-way framework is useful for dealing with uncertainties in opinions and we were able to show that significant F-measure for two benchmark dataset.
{"title":"Integration of Fuzzy and Deep Learning in Three-Way Decisions","authors":"L. Subhashini, Yuefeng Li, Jinglan Zhang, Ajantha S Atukorale","doi":"10.1109/ICDMW51313.2020.00019","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00019","url":null,"abstract":"The problem of uncertainty is a challenging issue to solve in opinion mining models. Existing models that use machine learning algorithms are unable to identify uncertainty within online customer reviews because of broad uncertain boundaries. Many researchers have developed fuzzy models to solve this problem. However, the problem of large uncertain boundaries remains with fuzzy models. The common challenging issue is that there is a big uncertain boundary between positive and negative classes as user reviews (or opinions) include many uncertainties. Dealing with these uncertainties is problematic due in many frequently used words may be non-relevant. This paper proposes a three-way based framework which integrates fuzzy concepts and deep learning together to solve the problem of uncertainty. Many experiments were conducted using movie review and ebook review datasets. The experimental results show that the proposed three-way framework is useful for dealing with uncertainties in opinions and we were able to show that significant F-measure for two benchmark dataset.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114267495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}