Abnormal host detection is a critical issue in an enterprise intranet data center. The traditional anomaly host detection method mainly focuses on detecting anomaly behavior, and the abnormality determination for a single behavior point often has certain limitations. For example, the entire attack process cannot be completely restored. And it will cause a lot of underreporting. Therefore, in this paper, we propose A Behavior Sequence Clustering-based Enterprise Network Anomaly Host Detection Method to solve the problem of anomaly host detection of an enterprise network. We use the Toeplitz Inverse Covariance-Based Clustering (TICC) algorithm [1] to segment and cluster time series data and mining anomaly host behavior sequences, identify the anomaly host of the enterprise network. The experimental results show that the Behavior Sequence Clustering-based Enterprise Network Anomaly Host Recognition Method can quickly identify the anomaly host and accurately restore the complete attack process.
{"title":"A Behavior Sequence Clustering-Based Enterprise Network Anomaly Host Recognition Method","authors":"Jing Tao, Ning Zheng, Waner Wang, Ting Han, Xuna Zhan, Qingxin Luan","doi":"10.1109/ICBK.2019.00039","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00039","url":null,"abstract":"Abnormal host detection is a critical issue in an enterprise intranet data center. The traditional anomaly host detection method mainly focuses on detecting anomaly behavior, and the abnormality determination for a single behavior point often has certain limitations. For example, the entire attack process cannot be completely restored. And it will cause a lot of underreporting. Therefore, in this paper, we propose A Behavior Sequence Clustering-based Enterprise Network Anomaly Host Detection Method to solve the problem of anomaly host detection of an enterprise network. We use the Toeplitz Inverse Covariance-Based Clustering (TICC) algorithm [1] to segment and cluster time series data and mining anomaly host behavior sequences, identify the anomaly host of the enterprise network. The experimental results show that the Behavior Sequence Clustering-based Enterprise Network Anomaly Host Recognition Method can quickly identify the anomaly host and accurately restore the complete attack process.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131673319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spatial co-location pattern mining is a process of finding a group of distinct spatial features whose instances frequently appear in close proximity to each other. The proximity of instances is often defined by the distance between them, if the distance is smaller than a distance threshold specified by users, they have a neighbor relationship. However, in this definition, the proximity of instances deeply depends on the distance threshold, the heterogeneity of the distribution density of spatial datasets is neglected, and it is hard for users to give a suitable threshold value. In this paper, we propose a statistical method that eliminates the distance threshold parameters from users to determine the neighbor relationships of instances in space. First, the proximity of instances is roughly materialized by employing Delaunay triangulation. Then, according to the statistical information of the vertices and edges in the Delaunay triangulation, we design three strategies to constrain the Delaunay triangulation. The neighbor relationships of instances are extracted automatically and accurately from the constrained Delaunay triangulation without requiring users to specify distance thresholds. After that, we propose a k-order neighbor notion to get neighborhoods of instances for mining co-location patterns. Finally, we develop a constrained Delaunay triangulation-based k-order neighborhood co-location pattern mining algorithm called CDT-kN-CP. The results of testing our algorithm on both synthetic datasets and the real point-of-interest datasets of Beijing and Guangzhou, China indicate that our method improves both accuracy and scalability compared with previous methods.
{"title":"A Spatial Co-location Pattern Mining Algorithm Without Distance Thresholds","authors":"Vanha Tran, Lizhen Wang, Hongmei Chen","doi":"10.1109/ICBK.2019.00040","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00040","url":null,"abstract":"Spatial co-location pattern mining is a process of finding a group of distinct spatial features whose instances frequently appear in close proximity to each other. The proximity of instances is often defined by the distance between them, if the distance is smaller than a distance threshold specified by users, they have a neighbor relationship. However, in this definition, the proximity of instances deeply depends on the distance threshold, the heterogeneity of the distribution density of spatial datasets is neglected, and it is hard for users to give a suitable threshold value. In this paper, we propose a statistical method that eliminates the distance threshold parameters from users to determine the neighbor relationships of instances in space. First, the proximity of instances is roughly materialized by employing Delaunay triangulation. Then, according to the statistical information of the vertices and edges in the Delaunay triangulation, we design three strategies to constrain the Delaunay triangulation. The neighbor relationships of instances are extracted automatically and accurately from the constrained Delaunay triangulation without requiring users to specify distance thresholds. After that, we propose a k-order neighbor notion to get neighborhoods of instances for mining co-location patterns. Finally, we develop a constrained Delaunay triangulation-based k-order neighborhood co-location pattern mining algorithm called CDT-kN-CP. The results of testing our algorithm on both synthetic datasets and the real point-of-interest datasets of Beijing and Guangzhou, China indicate that our method improves both accuracy and scalability compared with previous methods.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115314693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Criminological research and theory have traditionally focused on individual offenders and macro-level analysis to characterize crime distribution. However local aspects of crime activities have been also recognized as important factors in crime analysis. It is an interesting problem to discover implicit local patterns between crime activities and environmental factors such as nearby facilities and business establishment types. This work presents micro-level analysis of criminal incidents using spatial association rule mining. We show how to process crime incident points and their spatial relationships with task-relevant other spatial features, and discover interesting crime patterns using an association rule mining algorithm. A case study was conducted with real incident records and points of interest in a study area to discover interesting relationship patterns among crimes, their characteristics, and nearby spatial features. This study shows that our approach with spatial association rule mining is promising for micro-level analysis of crime.
{"title":"Micro-Level Incident Analysis using Spatial Association Rule Mining","authors":"J. S. Yoo, Sang Jun Park, Aneesh Raman","doi":"10.1109/ICBK.2019.00049","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00049","url":null,"abstract":"Criminological research and theory have traditionally focused on individual offenders and macro-level analysis to characterize crime distribution. However local aspects of crime activities have been also recognized as important factors in crime analysis. It is an interesting problem to discover implicit local patterns between crime activities and environmental factors such as nearby facilities and business establishment types. This work presents micro-level analysis of criminal incidents using spatial association rule mining. We show how to process crime incident points and their spatial relationships with task-relevant other spatial features, and discover interesting crime patterns using an association rule mining algorithm. A case study was conducted with real incident records and points of interest in a study area to discover interesting relationship patterns among crimes, their characteristics, and nearby spatial features. This study shows that our approach with spatial association rule mining is promising for micro-level analysis of crime.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127303415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The co-location pattern is a subset of spatial features that are frequently located together in spatial proximity. However, the traditional approaches only focus on the prevalence of patterns, and it cannot reflect the influence of patterns. In this paper, we are committed to address the problem of mining high influence co-location patterns. At first, we define the concepts of influence features and reference features. Based on these concepts, a series of definitions are introduced further to describe the influence co-location pattern. Secondly, a metric is designed to measure the influence degree of the influence co-location pattern, and a basic algorithm for mining high influence co-location patterns is presented. Then, according to the properties of the influence co-location pattern, the corresponding pruning strategy is proposed to improve the efficiency of the algorithm. At last, we conduct extensive experiments on synthetic and real data sets to test our approaches. Experimental results show that our approaches are effective and efficient to discover high influence co-location patterns.
{"title":"Discovering High Influence Co-location Patterns from Spatial Data Sets","authors":"Lili Lei, Lizhen Wang, Yuming Zeng, Lanqing Zeng","doi":"10.1109/ICBK.2019.00026","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00026","url":null,"abstract":"The co-location pattern is a subset of spatial features that are frequently located together in spatial proximity. However, the traditional approaches only focus on the prevalence of patterns, and it cannot reflect the influence of patterns. In this paper, we are committed to address the problem of mining high influence co-location patterns. At first, we define the concepts of influence features and reference features. Based on these concepts, a series of definitions are introduced further to describe the influence co-location pattern. Secondly, a metric is designed to measure the influence degree of the influence co-location pattern, and a basic algorithm for mining high influence co-location patterns is presented. Then, according to the properties of the influence co-location pattern, the corresponding pruning strategy is proposed to improve the efficiency of the algorithm. At last, we conduct extensive experiments on synthetic and real data sets to test our approaches. Experimental results show that our approaches are effective and efficient to discover high influence co-location patterns.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131899947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Jin, Kai Zhang, Yunhaonan Yang, Shuanglian Xie, Kai Song, Yonghua Hu, X. Bao
The premise of the full use of unstructured electronic medical records is to maintain the fully protection of a patient's information privacy. Presently, in prior of processing the electronic medical record date, identification and removing of relevant information which can be used to identify a patient is a research hotspot nowadays. There are very few methods in de–identification of Chinese electronic medical records and their cross–center performance is poor. Therefore we develop a de-identification method which is a mixture of rule-based methods and machine learning methods. The method was tested on 700 electronic medical records from six hospitals. Five-fold cross test was used to evaluate the results of c5.0, Random Forest, SVM and XGBOOST. Leave-one-out test was used to evaluate CRF. And the F1 Measure of machine learning reached 91.18% in PHI_Names, 98.21% in PHI_MEDICALID, 95.74% in PHI_OTHERNFC, 97.14% in PHI_GEO, 89.19% in PHI_DATES, and 91.49% in PHI_TEL. And the F1 Measure of rule-based methods reached 93.00% in PHI_Names, 97.00% in PHI_MEDICALID, 97.00% in PHI_OTHERNFC, 97.00% in PHI_GEO, 96.00% in PHI_DATES, and 89.00% in PHI_TEL.
{"title":"A Hybrid Machine Learning Method for the De-identification of Un-Structured Narrative Clinical Text in Multi-center Chinese Electronic Medical Records Data","authors":"M. Jin, Kai Zhang, Yunhaonan Yang, Shuanglian Xie, Kai Song, Yonghua Hu, X. Bao","doi":"10.1109/ICBK.2019.00023","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00023","url":null,"abstract":"The premise of the full use of unstructured electronic medical records is to maintain the fully protection of a patient's information privacy. Presently, in prior of processing the electronic medical record date, identification and removing of relevant information which can be used to identify a patient is a research hotspot nowadays. There are very few methods in de–identification of Chinese electronic medical records and their cross–center performance is poor. Therefore we develop a de-identification method which is a mixture of rule-based methods and machine learning methods. The method was tested on 700 electronic medical records from six hospitals. Five-fold cross test was used to evaluate the results of c5.0, Random Forest, SVM and XGBOOST. Leave-one-out test was used to evaluate CRF. And the F1 Measure of machine learning reached 91.18% in PHI_Names, 98.21% in PHI_MEDICALID, 95.74% in PHI_OTHERNFC, 97.14% in PHI_GEO, 89.19% in PHI_DATES, and 91.49% in PHI_TEL. And the F1 Measure of rule-based methods reached 93.00% in PHI_Names, 97.00% in PHI_MEDICALID, 97.00% in PHI_OTHERNFC, 97.00% in PHI_GEO, 96.00% in PHI_DATES, and 89.00% in PHI_TEL.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131918109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Label noise play an important role in classification. It can cause overfitting of learning methods and deteriorate their generalizability. The relative density method is effective in label noise detection, but it has high time complexity. On the other hand, the multi-granularity relative density method reduces the time cost, but the classification accuracy is also reduced. In this paper, we propose an improved relative density method, named the relative density method based on space partitioning (SPRD). The proposed method not only accelerates the label noise detection but also maintains a good classification performance. Also, the parameter k, which is used in the conventional relative density methods, is removed, making the proposed method adaptive. The experiment results on the UCI datasets demonstrate that the proposed method has higher efficiency than the conventional methods and better classification accuracy than the multi-granularity relative density method.
{"title":"A Fast Relative Density Method Based on Space Partitioning","authors":"Binggui Wang, Shuyin Xia, Hong Yu, Guoyin Wang","doi":"10.1109/ICBK.2019.00041","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00041","url":null,"abstract":"Label noise play an important role in classification. It can cause overfitting of learning methods and deteriorate their generalizability. The relative density method is effective in label noise detection, but it has high time complexity. On the other hand, the multi-granularity relative density method reduces the time cost, but the classification accuracy is also reduced. In this paper, we propose an improved relative density method, named the relative density method based on space partitioning (SPRD). The proposed method not only accelerates the label noise detection but also maintains a good classification performance. Also, the parameter k, which is used in the conventional relative density methods, is removed, making the proposed method adaptive. The experiment results on the UCI datasets demonstrate that the proposed method has higher efficiency than the conventional methods and better classification accuracy than the multi-granularity relative density method.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132681007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent works have managed to learn cross-lingual word embeddings (CLWEs) in an unsupervised manner. As a prominent unsupervised model, generative adversarial networks (GANs) have been heavily studied for unsupervised CLWEs learning by aligning the embedding spaces of different languages. Due to disturbing the embedding distribution, the embeddings of low-frequency words (LFEs) are usually treated as noises in the alignment process. To alleviate the impact of LFEs, existing GANs based models utilized a heuristic rule to aggressively sample the embeddings of high-frequency words (HFEs). However, such sampling rule lacks of theoretical support. In this paper, we propose a novel GANs based model to learn cross-lingual word embeddings without any parallel resource. To address the noise problem caused by the LFEs, some perturbations are injected into the LFEs for offsetting the distribution disturbance. In addition, a modified framework based on Cramér GAN is designed to train the perturbed LFEs and the HFEs jointly. Empirical evaluation on bilingual lexicon induction demonstrates that the proposed model outperforms the state-of-the-art GANs based model in several language pairs.
{"title":"Unsupervised Cross-Lingual Word Embeddings Learning with Adversarial Training","authors":"Yuling Li, Yuhong Zhang, Peipei Li, Xuegang Hu","doi":"10.1109/ICBK.2019.00029","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00029","url":null,"abstract":"Recent works have managed to learn cross-lingual word embeddings (CLWEs) in an unsupervised manner. As a prominent unsupervised model, generative adversarial networks (GANs) have been heavily studied for unsupervised CLWEs learning by aligning the embedding spaces of different languages. Due to disturbing the embedding distribution, the embeddings of low-frequency words (LFEs) are usually treated as noises in the alignment process. To alleviate the impact of LFEs, existing GANs based models utilized a heuristic rule to aggressively sample the embeddings of high-frequency words (HFEs). However, such sampling rule lacks of theoretical support. In this paper, we propose a novel GANs based model to learn cross-lingual word embeddings without any parallel resource. To address the noise problem caused by the LFEs, some perturbations are injected into the LFEs for offsetting the distribution disturbance. In addition, a modified framework based on Cramér GAN is designed to train the perturbed LFEs and the HFEs jointly. Empirical evaluation on bilingual lexicon induction demonstrates that the proposed model outperforms the state-of-the-art GANs based model in several language pairs.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129014511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tsung-Yu Hsieh, Yiwei Sun, Suhang Wang, Vasant G Honavar
With the advent of big data, there is an urgent need for methods and tools for integrative analyses of multi-modal or multi-view data. Of particular interest are unsupervised methods for parsimonious selection of non-redundant, complementary, and information-rich features from multi-view data. We introduce Adaptive Structural Co-Regularization Algorithm (ASCRA) for unsupervised multi-view feature selection. ASCRA jointly optimizes the embeddings of the different views so as to maximize their agreement with a consensus embedding which aims to simultaneously recover the latent cluster structure in the multi-view data while accounting for correlations between views. ASCRA uses the consensus embedding to guide efficient selection of features that preserve the latent cluster structure of the multi-view data. We establish ASCRA's convergence properties and analyze its computational complexity. The results of our experiments using several real-world and synthetic data sets suggest that ASCRA outperforms or is competitive with state-of-the-art unsupervised multi-view feature selection methods.
{"title":"Adaptive Structural Co-regularization for Unsupervised Multi-view Feature Selection","authors":"Tsung-Yu Hsieh, Yiwei Sun, Suhang Wang, Vasant G Honavar","doi":"10.1109/ICBK.2019.00020","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00020","url":null,"abstract":"With the advent of big data, there is an urgent need for methods and tools for integrative analyses of multi-modal or multi-view data. Of particular interest are unsupervised methods for parsimonious selection of non-redundant, complementary, and information-rich features from multi-view data. We introduce Adaptive Structural Co-Regularization Algorithm (ASCRA) for unsupervised multi-view feature selection. ASCRA jointly optimizes the embeddings of the different views so as to maximize their agreement with a consensus embedding which aims to simultaneously recover the latent cluster structure in the multi-view data while accounting for correlations between views. ASCRA uses the consensus embedding to guide efficient selection of features that preserve the latent cluster structure of the multi-view data. We establish ASCRA's convergence properties and analyze its computational complexity. The results of our experiments using several real-world and synthetic data sets suggest that ASCRA outperforms or is competitive with state-of-the-art unsupervised multi-view feature selection methods.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115325326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the advent of the era of big data, the information from multi-sources often conflicts due to that errors and fake information are inevitable. Therefore, how to obtain the most trustworthy or true information (i.e. truth) people need gradually becomes a troublesome problem. In order to meet this challenge, a novel hot technology named truth discovery that can infer the truth and estimate the reliability of the source without supervision has attracted more and more attention. However, most existing truth discovery methods only consider that the information is either same or different rather than the fine-grained relation between them, such as inclusion, support, mutual exclusion, etc. Actually, this situation frequently exists in real-world applications. To tackle the aforementioned issue, we propose a novel truth discovery method named OTDCR in this paper, which can handle the fine-grained relation between the information and infer the truth more effectively through modeling the relation. In addition, a novel method of processing abnormal values is applied to the preprocessing of truth discovery, which is specially designed for categorical data with the relation. Experiments in real dataset show our method is more effective than several outstanding methods.
{"title":"An Optimization-Based Truth Discovery Method with Claim Relation","authors":"Jiazhu Xia, Ying He, Yuxin Jin, Xianyu Bao, Gongqing Wu","doi":"10.1109/ICBK.2019.00046","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00046","url":null,"abstract":"With the advent of the era of big data, the information from multi-sources often conflicts due to that errors and fake information are inevitable. Therefore, how to obtain the most trustworthy or true information (i.e. truth) people need gradually becomes a troublesome problem. In order to meet this challenge, a novel hot technology named truth discovery that can infer the truth and estimate the reliability of the source without supervision has attracted more and more attention. However, most existing truth discovery methods only consider that the information is either same or different rather than the fine-grained relation between them, such as inclusion, support, mutual exclusion, etc. Actually, this situation frequently exists in real-world applications. To tackle the aforementioned issue, we propose a novel truth discovery method named OTDCR in this paper, which can handle the fine-grained relation between the information and infer the truth more effectively through modeling the relation. In addition, a novel method of processing abnormal values is applied to the preprocessing of truth discovery, which is specially designed for categorical data with the relation. Experiments in real dataset show our method is more effective than several outstanding methods.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131000696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
User identity linkage (UIL) refers to linking accounts of the same user across different online social platforms. The state-of-the-art UIL methods usually perform account matching using user account's features derived from the profile attributes, content and relationships. They are however static and do not adapt well to fast-changing online social data due to: (a) new content and activities generated by users; as well as (b) new platform functions introduced to users. In particular, the importance of features used in UIL methods may change over time and new important user features may be introduced. In this paper, we proposed AD-Link, a new UIL method which (i) learns and assigns weights to the user features used for user identity linkage and (ii) handles new user features introduced by new user-generated data. We evaluated AD-Link on real-world datasets from three popular online social platforms, namely, Twitter, Facebook and Foursquare. The results show that AD-Link outperforms the state-of-the-art UIL methods.
{"title":"AD-Link: An Adaptive Approach for User Identity Linkage","authors":"Xin Mu, Wei Xie, R. Lee, Feida Zhu, Ee-Peng Lim","doi":"10.1109/ICBK.2019.00032","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00032","url":null,"abstract":"User identity linkage (UIL) refers to linking accounts of the same user across different online social platforms. The state-of-the-art UIL methods usually perform account matching using user account's features derived from the profile attributes, content and relationships. They are however static and do not adapt well to fast-changing online social data due to: (a) new content and activities generated by users; as well as (b) new platform functions introduced to users. In particular, the importance of features used in UIL methods may change over time and new important user features may be introduced. In this paper, we proposed AD-Link, a new UIL method which (i) learns and assigns weights to the user features used for user identity linkage and (ii) handles new user features introduced by new user-generated data. We evaluated AD-Link on real-world datasets from three popular online social platforms, namely, Twitter, Facebook and Foursquare. The results show that AD-Link outperforms the state-of-the-art UIL methods.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124184146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}