首页 > 最新文献

2019 IEEE International Conference on Big Knowledge (ICBK)最新文献

英文 中文
A Behavior Sequence Clustering-Based Enterprise Network Anomaly Host Recognition Method 基于行为序列聚类的企业网络异常主机识别方法
Pub Date : 2019-11-01 DOI: 10.1109/ICBK.2019.00039
Jing Tao, Ning Zheng, Waner Wang, Ting Han, Xuna Zhan, Qingxin Luan
Abnormal host detection is a critical issue in an enterprise intranet data center. The traditional anomaly host detection method mainly focuses on detecting anomaly behavior, and the abnormality determination for a single behavior point often has certain limitations. For example, the entire attack process cannot be completely restored. And it will cause a lot of underreporting. Therefore, in this paper, we propose A Behavior Sequence Clustering-based Enterprise Network Anomaly Host Detection Method to solve the problem of anomaly host detection of an enterprise network. We use the Toeplitz Inverse Covariance-Based Clustering (TICC) algorithm [1] to segment and cluster time series data and mining anomaly host behavior sequences, identify the anomaly host of the enterprise network. The experimental results show that the Behavior Sequence Clustering-based Enterprise Network Anomaly Host Recognition Method can quickly identify the anomaly host and accurately restore the complete attack process.
异常主机检测是企业内网数据中心的关键问题。传统的异常主机检测方法主要集中在异常行为的检测上,对单个行为点的异常判定往往存在一定的局限性。例如,无法完全恢复整个攻击过程。这将导致大量漏报。因此,本文提出了一种基于行为序列聚类的企业网络异常主机检测方法来解决企业网络异常主机检测问题。我们使用Toeplitz逆协方差聚类(TICC)算法[1]对时间序列数据进行分段和聚类,挖掘异常主机行为序列,识别企业网络的异常主机。实验结果表明,基于行为序列聚类的企业网络异常主机识别方法能够快速识别异常主机,准确还原攻击的完整过程。
{"title":"A Behavior Sequence Clustering-Based Enterprise Network Anomaly Host Recognition Method","authors":"Jing Tao, Ning Zheng, Waner Wang, Ting Han, Xuna Zhan, Qingxin Luan","doi":"10.1109/ICBK.2019.00039","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00039","url":null,"abstract":"Abnormal host detection is a critical issue in an enterprise intranet data center. The traditional anomaly host detection method mainly focuses on detecting anomaly behavior, and the abnormality determination for a single behavior point often has certain limitations. For example, the entire attack process cannot be completely restored. And it will cause a lot of underreporting. Therefore, in this paper, we propose A Behavior Sequence Clustering-based Enterprise Network Anomaly Host Detection Method to solve the problem of anomaly host detection of an enterprise network. We use the Toeplitz Inverse Covariance-Based Clustering (TICC) algorithm [1] to segment and cluster time series data and mining anomaly host behavior sequences, identify the anomaly host of the enterprise network. The experimental results show that the Behavior Sequence Clustering-based Enterprise Network Anomaly Host Recognition Method can quickly identify the anomaly host and accurately restore the complete attack process.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131673319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Spatial Co-location Pattern Mining Algorithm Without Distance Thresholds 一种无距离阈值的空间共定位模式挖掘算法
Pub Date : 2019-11-01 DOI: 10.1109/ICBK.2019.00040
Vanha Tran, Lizhen Wang, Hongmei Chen
Spatial co-location pattern mining is a process of finding a group of distinct spatial features whose instances frequently appear in close proximity to each other. The proximity of instances is often defined by the distance between them, if the distance is smaller than a distance threshold specified by users, they have a neighbor relationship. However, in this definition, the proximity of instances deeply depends on the distance threshold, the heterogeneity of the distribution density of spatial datasets is neglected, and it is hard for users to give a suitable threshold value. In this paper, we propose a statistical method that eliminates the distance threshold parameters from users to determine the neighbor relationships of instances in space. First, the proximity of instances is roughly materialized by employing Delaunay triangulation. Then, according to the statistical information of the vertices and edges in the Delaunay triangulation, we design three strategies to constrain the Delaunay triangulation. The neighbor relationships of instances are extracted automatically and accurately from the constrained Delaunay triangulation without requiring users to specify distance thresholds. After that, we propose a k-order neighbor notion to get neighborhoods of instances for mining co-location patterns. Finally, we develop a constrained Delaunay triangulation-based k-order neighborhood co-location pattern mining algorithm called CDT-kN-CP. The results of testing our algorithm on both synthetic datasets and the real point-of-interest datasets of Beijing and Guangzhou, China indicate that our method improves both accuracy and scalability compared with previous methods.
空间同位模式挖掘是一个寻找一组不同的空间特征的过程,这些特征的实例经常出现在彼此接近的位置。实例之间的接近度通常由它们之间的距离来定义,如果距离小于用户指定的距离阈值,则它们具有邻居关系。然而,在这个定义中,实例的接近程度严重依赖于距离阈值,忽略了空间数据集分布密度的异质性,用户很难给出一个合适的阈值。在本文中,我们提出了一种消除用户距离阈值参数的统计方法来确定空间中实例的邻居关系。首先,使用Delaunay三角剖分法大致实现实例的接近性。然后,根据Delaunay三角剖分中顶点和边的统计信息,设计了三种约束Delaunay三角剖分的策略。在不需要用户指定距离阈值的情况下,从约束Delaunay三角剖分中自动准确地提取实例的邻居关系。在此基础上,我们提出了一个k阶邻域概念来获取实例的邻域,用于挖掘同位模式。最后,我们开发了一种基于约束Delaunay三角的k阶邻域共定位模式挖掘算法CDT-kN-CP。在中国北京和广州的合成数据集和真实兴趣点数据集上进行的测试结果表明,与以前的方法相比,我们的方法在精度和可扩展性方面都有提高。
{"title":"A Spatial Co-location Pattern Mining Algorithm Without Distance Thresholds","authors":"Vanha Tran, Lizhen Wang, Hongmei Chen","doi":"10.1109/ICBK.2019.00040","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00040","url":null,"abstract":"Spatial co-location pattern mining is a process of finding a group of distinct spatial features whose instances frequently appear in close proximity to each other. The proximity of instances is often defined by the distance between them, if the distance is smaller than a distance threshold specified by users, they have a neighbor relationship. However, in this definition, the proximity of instances deeply depends on the distance threshold, the heterogeneity of the distribution density of spatial datasets is neglected, and it is hard for users to give a suitable threshold value. In this paper, we propose a statistical method that eliminates the distance threshold parameters from users to determine the neighbor relationships of instances in space. First, the proximity of instances is roughly materialized by employing Delaunay triangulation. Then, according to the statistical information of the vertices and edges in the Delaunay triangulation, we design three strategies to constrain the Delaunay triangulation. The neighbor relationships of instances are extracted automatically and accurately from the constrained Delaunay triangulation without requiring users to specify distance thresholds. After that, we propose a k-order neighbor notion to get neighborhoods of instances for mining co-location patterns. Finally, we develop a constrained Delaunay triangulation-based k-order neighborhood co-location pattern mining algorithm called CDT-kN-CP. The results of testing our algorithm on both synthetic datasets and the real point-of-interest datasets of Beijing and Guangzhou, China indicate that our method improves both accuracy and scalability compared with previous methods.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115314693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Micro-Level Incident Analysis using Spatial Association Rule Mining 基于空间关联规则挖掘的微观事件分析
Pub Date : 2019-11-01 DOI: 10.1109/ICBK.2019.00049
J. S. Yoo, Sang Jun Park, Aneesh Raman
Criminological research and theory have traditionally focused on individual offenders and macro-level analysis to characterize crime distribution. However local aspects of crime activities have been also recognized as important factors in crime analysis. It is an interesting problem to discover implicit local patterns between crime activities and environmental factors such as nearby facilities and business establishment types. This work presents micro-level analysis of criminal incidents using spatial association rule mining. We show how to process crime incident points and their spatial relationships with task-relevant other spatial features, and discover interesting crime patterns using an association rule mining algorithm. A case study was conducted with real incident records and points of interest in a study area to discover interesting relationship patterns among crimes, their characteristics, and nearby spatial features. This study shows that our approach with spatial association rule mining is promising for micro-level analysis of crime.
犯罪学的研究和理论传统上侧重于罪犯个人和宏观层面的分析,以表征犯罪分布。然而,犯罪活动的地方方面也被认为是犯罪分析中的重要因素。发现犯罪活动与环境因素(如附近的设施和商业机构类型)之间隐含的地方模式是一个有趣的问题。这项工作提出了使用空间关联规则挖掘对犯罪事件进行微观分析。我们展示了如何处理犯罪事件点及其与任务相关的其他空间特征的空间关系,并使用关联规则挖掘算法发现有趣的犯罪模式。通过对研究区域内的真实事件记录和兴趣点进行案例研究,发现犯罪之间有趣的关系模式、特征和附近的空间特征。该研究表明,我们的空间关联规则挖掘方法在微观层面的犯罪分析中是有前景的。
{"title":"Micro-Level Incident Analysis using Spatial Association Rule Mining","authors":"J. S. Yoo, Sang Jun Park, Aneesh Raman","doi":"10.1109/ICBK.2019.00049","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00049","url":null,"abstract":"Criminological research and theory have traditionally focused on individual offenders and macro-level analysis to characterize crime distribution. However local aspects of crime activities have been also recognized as important factors in crime analysis. It is an interesting problem to discover implicit local patterns between crime activities and environmental factors such as nearby facilities and business establishment types. This work presents micro-level analysis of criminal incidents using spatial association rule mining. We show how to process crime incident points and their spatial relationships with task-relevant other spatial features, and discover interesting crime patterns using an association rule mining algorithm. A case study was conducted with real incident records and points of interest in a study area to discover interesting relationship patterns among crimes, their characteristics, and nearby spatial features. This study shows that our approach with spatial association rule mining is promising for micro-level analysis of crime.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127303415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Discovering High Influence Co-location Patterns from Spatial Data Sets 从空间数据集中发现高影响力的同位模式
Pub Date : 2019-11-01 DOI: 10.1109/ICBK.2019.00026
Lili Lei, Lizhen Wang, Yuming Zeng, Lanqing Zeng
The co-location pattern is a subset of spatial features that are frequently located together in spatial proximity. However, the traditional approaches only focus on the prevalence of patterns, and it cannot reflect the influence of patterns. In this paper, we are committed to address the problem of mining high influence co-location patterns. At first, we define the concepts of influence features and reference features. Based on these concepts, a series of definitions are introduced further to describe the influence co-location pattern. Secondly, a metric is designed to measure the influence degree of the influence co-location pattern, and a basic algorithm for mining high influence co-location patterns is presented. Then, according to the properties of the influence co-location pattern, the corresponding pruning strategy is proposed to improve the efficiency of the algorithm. At last, we conduct extensive experiments on synthetic and real data sets to test our approaches. Experimental results show that our approaches are effective and efficient to discover high influence co-location patterns.
同位模式是空间特征的一个子集,这些空间特征经常在空间接近中位于一起。然而,传统的方法只关注模式的流行程度,而不能反映模式的影响。在本文中,我们致力于解决挖掘高影响力同址模式的问题。首先定义了影响特征和参考特征的概念。在这些概念的基础上,进一步引入了一系列定义来描述影响共位模式。其次,设计了一种度量影响共位模式影响程度的指标,提出了一种挖掘高影响共位模式的基本算法;然后,根据影响共定位模式的特性,提出相应的剪枝策略,提高算法的效率。最后,我们在合成数据集和真实数据集上进行了大量的实验来测试我们的方法。实验结果表明,该方法能够有效地发现高影响力的同位模式。
{"title":"Discovering High Influence Co-location Patterns from Spatial Data Sets","authors":"Lili Lei, Lizhen Wang, Yuming Zeng, Lanqing Zeng","doi":"10.1109/ICBK.2019.00026","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00026","url":null,"abstract":"The co-location pattern is a subset of spatial features that are frequently located together in spatial proximity. However, the traditional approaches only focus on the prevalence of patterns, and it cannot reflect the influence of patterns. In this paper, we are committed to address the problem of mining high influence co-location patterns. At first, we define the concepts of influence features and reference features. Based on these concepts, a series of definitions are introduced further to describe the influence co-location pattern. Secondly, a metric is designed to measure the influence degree of the influence co-location pattern, and a basic algorithm for mining high influence co-location patterns is presented. Then, according to the properties of the influence co-location pattern, the corresponding pruning strategy is proposed to improve the efficiency of the algorithm. At last, we conduct extensive experiments on synthetic and real data sets to test our approaches. Experimental results show that our approaches are effective and efficient to discover high influence co-location patterns.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131899947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Hybrid Machine Learning Method for the De-identification of Un-Structured Narrative Clinical Text in Multi-center Chinese Electronic Medical Records Data 基于混合机器学习的多中心中文电子病历非结构化叙事临床文本去识别方法
Pub Date : 2019-11-01 DOI: 10.1109/ICBK.2019.00023
M. Jin, Kai Zhang, Yunhaonan Yang, Shuanglian Xie, Kai Song, Yonghua Hu, X. Bao
The premise of the full use of unstructured electronic medical records is to maintain the fully protection of a patient's information privacy. Presently, in prior of processing the electronic medical record date, identification and removing of relevant information which can be used to identify a patient is a research hotspot nowadays. There are very few methods in de–identification of Chinese electronic medical records and their cross–center performance is poor. Therefore we develop a de-identification method which is a mixture of rule-based methods and machine learning methods. The method was tested on 700 electronic medical records from six hospitals. Five-fold cross test was used to evaluate the results of c5.0, Random Forest, SVM and XGBOOST. Leave-one-out test was used to evaluate CRF. And the F1 Measure of machine learning reached 91.18% in PHI_Names, 98.21% in PHI_MEDICALID, 95.74% in PHI_OTHERNFC, 97.14% in PHI_GEO, 89.19% in PHI_DATES, and 91.49% in PHI_TEL. And the F1 Measure of rule-based methods reached 93.00% in PHI_Names, 97.00% in PHI_MEDICALID, 97.00% in PHI_OTHERNFC, 97.00% in PHI_GEO, 96.00% in PHI_DATES, and 89.00% in PHI_TEL.
充分利用非结构化电子病历的前提是保持对患者信息隐私的充分保护。目前,在对电子病历数据进行处理之前,对相关信息进行识别和删除是当前的研究热点。我国电子病历的去识别方法较少,跨中心性能较差。因此,我们开发了一种基于规则的方法和机器学习方法的混合去识别方法。该方法在6家医院的700份电子病历上进行了测试。采用五重交叉检验对c5.0、Random Forest、SVM和XGBOOST的结果进行评价。采用留一检验评价CRF。机器学习的F1测度在PHI_Names中达到91.18%,在PHI_MEDICALID中达到98.21%,在PHI_OTHERNFC中达到95.74%,在PHI_GEO中达到97.14%,在PHI_DATES中达到89.19%,在PHI_TEL中达到91.49%。基于规则方法的F1测度在PHI_Names中达到93.00%,在PHI_MEDICALID中达到97.00%,在PHI_OTHERNFC中达到97.00%,在PHI_GEO中达到97.00%,在PHI_DATES中达到96.00%,在PHI_TEL中达到89.00%。
{"title":"A Hybrid Machine Learning Method for the De-identification of Un-Structured Narrative Clinical Text in Multi-center Chinese Electronic Medical Records Data","authors":"M. Jin, Kai Zhang, Yunhaonan Yang, Shuanglian Xie, Kai Song, Yonghua Hu, X. Bao","doi":"10.1109/ICBK.2019.00023","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00023","url":null,"abstract":"The premise of the full use of unstructured electronic medical records is to maintain the fully protection of a patient's information privacy. Presently, in prior of processing the electronic medical record date, identification and removing of relevant information which can be used to identify a patient is a research hotspot nowadays. There are very few methods in de–identification of Chinese electronic medical records and their cross–center performance is poor. Therefore we develop a de-identification method which is a mixture of rule-based methods and machine learning methods. The method was tested on 700 electronic medical records from six hospitals. Five-fold cross test was used to evaluate the results of c5.0, Random Forest, SVM and XGBOOST. Leave-one-out test was used to evaluate CRF. And the F1 Measure of machine learning reached 91.18% in PHI_Names, 98.21% in PHI_MEDICALID, 95.74% in PHI_OTHERNFC, 97.14% in PHI_GEO, 89.19% in PHI_DATES, and 91.49% in PHI_TEL. And the F1 Measure of rule-based methods reached 93.00% in PHI_Names, 97.00% in PHI_MEDICALID, 97.00% in PHI_OTHERNFC, 97.00% in PHI_GEO, 96.00% in PHI_DATES, and 89.00% in PHI_TEL.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131918109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Fast Relative Density Method Based on Space Partitioning 基于空间划分的快速相对密度方法
Pub Date : 2019-11-01 DOI: 10.1109/ICBK.2019.00041
Binggui Wang, Shuyin Xia, Hong Yu, Guoyin Wang
Label noise play an important role in classification. It can cause overfitting of learning methods and deteriorate their generalizability. The relative density method is effective in label noise detection, but it has high time complexity. On the other hand, the multi-granularity relative density method reduces the time cost, but the classification accuracy is also reduced. In this paper, we propose an improved relative density method, named the relative density method based on space partitioning (SPRD). The proposed method not only accelerates the label noise detection but also maintains a good classification performance. Also, the parameter k, which is used in the conventional relative density methods, is removed, making the proposed method adaptive. The experiment results on the UCI datasets demonstrate that the proposed method has higher efficiency than the conventional methods and better classification accuracy than the multi-granularity relative density method.
标签噪声在分类中起着重要的作用。它会导致学习方法的过拟合,降低学习方法的泛化能力。相对密度法是一种有效的标签噪声检测方法,但其时间复杂度较高。另一方面,多粒度相对密度方法减少了时间成本,但也降低了分类精度。本文提出了一种改进的相对密度方法,称为基于空间划分的相对密度方法(SPRD)。该方法不仅加快了标签噪声的检测速度,而且保持了良好的分类性能。此外,将传统相对密度方法中使用的参数k去除,使所提方法具有自适应性。在UCI数据集上的实验结果表明,该方法的分类效率高于常规方法,分类精度优于多粒度相对密度方法。
{"title":"A Fast Relative Density Method Based on Space Partitioning","authors":"Binggui Wang, Shuyin Xia, Hong Yu, Guoyin Wang","doi":"10.1109/ICBK.2019.00041","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00041","url":null,"abstract":"Label noise play an important role in classification. It can cause overfitting of learning methods and deteriorate their generalizability. The relative density method is effective in label noise detection, but it has high time complexity. On the other hand, the multi-granularity relative density method reduces the time cost, but the classification accuracy is also reduced. In this paper, we propose an improved relative density method, named the relative density method based on space partitioning (SPRD). The proposed method not only accelerates the label noise detection but also maintains a good classification performance. Also, the parameter k, which is used in the conventional relative density methods, is removed, making the proposed method adaptive. The experiment results on the UCI datasets demonstrate that the proposed method has higher efficiency than the conventional methods and better classification accuracy than the multi-granularity relative density method.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132681007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Unsupervised Cross-Lingual Word Embeddings Learning with Adversarial Training 基于对抗训练的无监督跨语言词嵌入学习
Pub Date : 2019-11-01 DOI: 10.1109/ICBK.2019.00029
Yuling Li, Yuhong Zhang, Peipei Li, Xuegang Hu
Recent works have managed to learn cross-lingual word embeddings (CLWEs) in an unsupervised manner. As a prominent unsupervised model, generative adversarial networks (GANs) have been heavily studied for unsupervised CLWEs learning by aligning the embedding spaces of different languages. Due to disturbing the embedding distribution, the embeddings of low-frequency words (LFEs) are usually treated as noises in the alignment process. To alleviate the impact of LFEs, existing GANs based models utilized a heuristic rule to aggressively sample the embeddings of high-frequency words (HFEs). However, such sampling rule lacks of theoretical support. In this paper, we propose a novel GANs based model to learn cross-lingual word embeddings without any parallel resource. To address the noise problem caused by the LFEs, some perturbations are injected into the LFEs for offsetting the distribution disturbance. In addition, a modified framework based on Cramér GAN is designed to train the perturbed LFEs and the HFEs jointly. Empirical evaluation on bilingual lexicon induction demonstrates that the proposed model outperforms the state-of-the-art GANs based model in several language pairs.
最近的研究已经设法以一种无监督的方式学习跨语言词嵌入。作为一种突出的无监督模型,生成对抗网络(GANs)通过对齐不同语言的嵌入空间来进行无监督CLWEs学习已经得到了大量的研究。由于低频词的嵌入会干扰嵌入分布,在对齐过程中,低频词的嵌入通常被当作噪声处理。为了减轻高频词嵌入的影响,现有的基于gan的模型利用启发式规则对高频词嵌入进行积极采样。然而,这种抽样规则缺乏理论支持。本文提出了一种新的基于gan的跨语言词嵌入学习模型,该模型无需任何并行资源。为了解决由lfe引起的噪声问题,在lfe中注入了一些扰动来抵消分布扰动。在此基础上,设计了一种基于cramamer GAN的改进框架,用于联合训练受扰动的lfe和hfe。对双语词汇归纳的实证评价表明,该模型在几种语言对上的表现优于基于gan的最先进模型。
{"title":"Unsupervised Cross-Lingual Word Embeddings Learning with Adversarial Training","authors":"Yuling Li, Yuhong Zhang, Peipei Li, Xuegang Hu","doi":"10.1109/ICBK.2019.00029","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00029","url":null,"abstract":"Recent works have managed to learn cross-lingual word embeddings (CLWEs) in an unsupervised manner. As a prominent unsupervised model, generative adversarial networks (GANs) have been heavily studied for unsupervised CLWEs learning by aligning the embedding spaces of different languages. Due to disturbing the embedding distribution, the embeddings of low-frequency words (LFEs) are usually treated as noises in the alignment process. To alleviate the impact of LFEs, existing GANs based models utilized a heuristic rule to aggressively sample the embeddings of high-frequency words (HFEs). However, such sampling rule lacks of theoretical support. In this paper, we propose a novel GANs based model to learn cross-lingual word embeddings without any parallel resource. To address the noise problem caused by the LFEs, some perturbations are injected into the LFEs for offsetting the distribution disturbance. In addition, a modified framework based on Cramér GAN is designed to train the perturbed LFEs and the HFEs jointly. Empirical evaluation on bilingual lexicon induction demonstrates that the proposed model outperforms the state-of-the-art GANs based model in several language pairs.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129014511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Adaptive Structural Co-regularization for Unsupervised Multi-view Feature Selection 无监督多视图特征选择的自适应结构协正则化
Pub Date : 2019-11-01 DOI: 10.1109/ICBK.2019.00020
Tsung-Yu Hsieh, Yiwei Sun, Suhang Wang, Vasant G Honavar
With the advent of big data, there is an urgent need for methods and tools for integrative analyses of multi-modal or multi-view data. Of particular interest are unsupervised methods for parsimonious selection of non-redundant, complementary, and information-rich features from multi-view data. We introduce Adaptive Structural Co-Regularization Algorithm (ASCRA) for unsupervised multi-view feature selection. ASCRA jointly optimizes the embeddings of the different views so as to maximize their agreement with a consensus embedding which aims to simultaneously recover the latent cluster structure in the multi-view data while accounting for correlations between views. ASCRA uses the consensus embedding to guide efficient selection of features that preserve the latent cluster structure of the multi-view data. We establish ASCRA's convergence properties and analyze its computational complexity. The results of our experiments using several real-world and synthetic data sets suggest that ASCRA outperforms or is competitive with state-of-the-art unsupervised multi-view feature selection methods.
随着大数据时代的到来,迫切需要对多模态或多视图数据进行集成分析的方法和工具。特别令人感兴趣的是从多视图数据中简化选择非冗余、互补和信息丰富的特征的无监督方法。提出了一种用于无监督多视图特征选择的自适应结构协正则化算法(ASCRA)。ASCRA对不同视图的嵌入进行联合优化,使它们的一致性最大化,形成共识嵌入,目的是在考虑视图之间的相关性的同时恢复多视图数据中的潜在聚类结构。ASCRA使用共识嵌入来指导有效的特征选择,以保持多视图数据的潜在聚类结构。建立了ASCRA的收敛性,分析了其计算复杂度。我们使用几个真实世界和合成数据集的实验结果表明,ASCRA优于最先进的无监督多视图特征选择方法或具有竞争力。
{"title":"Adaptive Structural Co-regularization for Unsupervised Multi-view Feature Selection","authors":"Tsung-Yu Hsieh, Yiwei Sun, Suhang Wang, Vasant G Honavar","doi":"10.1109/ICBK.2019.00020","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00020","url":null,"abstract":"With the advent of big data, there is an urgent need for methods and tools for integrative analyses of multi-modal or multi-view data. Of particular interest are unsupervised methods for parsimonious selection of non-redundant, complementary, and information-rich features from multi-view data. We introduce Adaptive Structural Co-Regularization Algorithm (ASCRA) for unsupervised multi-view feature selection. ASCRA jointly optimizes the embeddings of the different views so as to maximize their agreement with a consensus embedding which aims to simultaneously recover the latent cluster structure in the multi-view data while accounting for correlations between views. ASCRA uses the consensus embedding to guide efficient selection of features that preserve the latent cluster structure of the multi-view data. We establish ASCRA's convergence properties and analyze its computational complexity. The results of our experiments using several real-world and synthetic data sets suggest that ASCRA outperforms or is competitive with state-of-the-art unsupervised multi-view feature selection methods.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115325326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
An Optimization-Based Truth Discovery Method with Claim Relation 一种具有索赔关系的基于优化的真值发现方法
Pub Date : 2019-11-01 DOI: 10.1109/ICBK.2019.00046
Jiazhu Xia, Ying He, Yuxin Jin, Xianyu Bao, Gongqing Wu
With the advent of the era of big data, the information from multi-sources often conflicts due to that errors and fake information are inevitable. Therefore, how to obtain the most trustworthy or true information (i.e. truth) people need gradually becomes a troublesome problem. In order to meet this challenge, a novel hot technology named truth discovery that can infer the truth and estimate the reliability of the source without supervision has attracted more and more attention. However, most existing truth discovery methods only consider that the information is either same or different rather than the fine-grained relation between them, such as inclusion, support, mutual exclusion, etc. Actually, this situation frequently exists in real-world applications. To tackle the aforementioned issue, we propose a novel truth discovery method named OTDCR in this paper, which can handle the fine-grained relation between the information and infer the truth more effectively through modeling the relation. In addition, a novel method of processing abnormal values is applied to the preprocessing of truth discovery, which is specially designed for categorical data with the relation. Experiments in real dataset show our method is more effective than several outstanding methods.
随着大数据时代的到来,多来源的信息往往会发生冲突,错误和虚假信息在所难免。因此,如何获得人们需要的最可信或最真实的信息(即真相)逐渐成为一个棘手的问题。为了应对这一挑战,一种能够在不受监督的情况下对消息源的真实性进行推断和可靠性估计的新技术——真相发现技术越来越受到人们的关注。然而,现有的大多数真值发现方法只考虑信息的相同或不同,而不考虑它们之间的细粒度关系,如包含、支持、互斥等。实际上,这种情况在实际应用程序中经常存在。为了解决上述问题,本文提出了一种新的真值发现方法OTDCR,该方法可以处理信息之间的细粒度关系,并通过对关系的建模更有效地推断出真值。此外,将一种新的异常值处理方法应用于真值发现的预处理中,该方法是专门针对具有关系的范畴数据而设计的。在实际数据集上的实验表明,该方法比几种已有的方法更有效。
{"title":"An Optimization-Based Truth Discovery Method with Claim Relation","authors":"Jiazhu Xia, Ying He, Yuxin Jin, Xianyu Bao, Gongqing Wu","doi":"10.1109/ICBK.2019.00046","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00046","url":null,"abstract":"With the advent of the era of big data, the information from multi-sources often conflicts due to that errors and fake information are inevitable. Therefore, how to obtain the most trustworthy or true information (i.e. truth) people need gradually becomes a troublesome problem. In order to meet this challenge, a novel hot technology named truth discovery that can infer the truth and estimate the reliability of the source without supervision has attracted more and more attention. However, most existing truth discovery methods only consider that the information is either same or different rather than the fine-grained relation between them, such as inclusion, support, mutual exclusion, etc. Actually, this situation frequently exists in real-world applications. To tackle the aforementioned issue, we propose a novel truth discovery method named OTDCR in this paper, which can handle the fine-grained relation between the information and infer the truth more effectively through modeling the relation. In addition, a novel method of processing abnormal values is applied to the preprocessing of truth discovery, which is specially designed for categorical data with the relation. Experiments in real dataset show our method is more effective than several outstanding methods.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131000696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
AD-Link: An Adaptive Approach for User Identity Linkage AD-Link:一种用户身份链接的自适应方法
Pub Date : 2019-11-01 DOI: 10.1109/ICBK.2019.00032
Xin Mu, Wei Xie, R. Lee, Feida Zhu, Ee-Peng Lim
User identity linkage (UIL) refers to linking accounts of the same user across different online social platforms. The state-of-the-art UIL methods usually perform account matching using user account's features derived from the profile attributes, content and relationships. They are however static and do not adapt well to fast-changing online social data due to: (a) new content and activities generated by users; as well as (b) new platform functions introduced to users. In particular, the importance of features used in UIL methods may change over time and new important user features may be introduced. In this paper, we proposed AD-Link, a new UIL method which (i) learns and assigns weights to the user features used for user identity linkage and (ii) handles new user features introduced by new user-generated data. We evaluated AD-Link on real-world datasets from three popular online social platforms, namely, Twitter, Facebook and Foursquare. The results show that AD-Link outperforms the state-of-the-art UIL methods.
用户身份链接(User identity linkage, UIL)是指将同一用户在不同网络社交平台上的账户进行链接。最先进的ui方法通常使用从概要文件属性、内容和关系派生的用户帐户特征来执行帐户匹配。然而,它们是静态的,不能很好地适应快速变化的在线社交数据,因为:(a)用户产生的新内容和活动;以及(b)向用户介绍的新平台功能。特别是,ui方法中使用的功能的重要性可能会随着时间的推移而改变,并且可能会引入新的重要用户功能。在本文中,我们提出了AD-Link,这是一种新的ui方法,它(i)学习并分配用于用户身份链接的用户特征的权重,(ii)处理新用户生成数据引入的新用户特征。我们在三个流行的在线社交平台(即Twitter、Facebook和Foursquare)的真实数据集上评估了AD-Link。结果表明,AD-Link优于最先进的UIL方法。
{"title":"AD-Link: An Adaptive Approach for User Identity Linkage","authors":"Xin Mu, Wei Xie, R. Lee, Feida Zhu, Ee-Peng Lim","doi":"10.1109/ICBK.2019.00032","DOIUrl":"https://doi.org/10.1109/ICBK.2019.00032","url":null,"abstract":"User identity linkage (UIL) refers to linking accounts of the same user across different online social platforms. The state-of-the-art UIL methods usually perform account matching using user account's features derived from the profile attributes, content and relationships. They are however static and do not adapt well to fast-changing online social data due to: (a) new content and activities generated by users; as well as (b) new platform functions introduced to users. In particular, the importance of features used in UIL methods may change over time and new important user features may be introduced. In this paper, we proposed AD-Link, a new UIL method which (i) learns and assigns weights to the user features used for user identity linkage and (ii) handles new user features introduced by new user-generated data. We evaluated AD-Link on real-world datasets from three popular online social platforms, namely, Twitter, Facebook and Foursquare. The results show that AD-Link outperforms the state-of-the-art UIL methods.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124184146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
2019 IEEE International Conference on Big Knowledge (ICBK)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1