For many large-scale multimedia datasets and web contents, the nearest neighbor search methods based on the hashing strategy for cross-modal retrieval have attracted considerable attention due to its fast query speed and low storage cost. Most existing hashing methods try to map different modalities to Hamming embedding in a supervised way where the semantic information comes from a large manual label matrix and each sample in different modalities is usually encoded by a sparse label vector. However, previous studies didn't address the semantic correlation learning challenges and couldn't make the best use of the prior semantic information. Therefore, they cannot preserve the accurate semantic similarities and often degrade the performance of hashing function learning. To fill this gap, we firstly proposed a novel Deep Semantic Correlation learning based Hashing framework (DSCH) that generates unified hash codes in an end-to-end deep learning architecture for cross-modal retrieval task. The major contribution in this work is to effectively automatically construct the semantic correlation between data representation and demonstrate how to utilize correlation information to generate hash codes for new samples. In particular, DSCH integrates latent semantic embedding with a unified hash embedding to strengthen the similarity information among multiple modalities. Furthermore, additional graph regularization is employed in our framework, to capture the correspondences from the inter-modal and intra-modal. Our model simultaneously learns the semantic correlation and the unified hash codes, which enhances the effectiveness of cross-modal retrieval task. Experimental results show the superior accuracy of our proposed approach to several state-of-the-art cross-modality methods on two large datasets.
{"title":"Deep Semantic Correlation Learning Based Hashing for Multimedia Cross-Modal Retrieval","authors":"Xiaolong Gong, Linpeng Huang, Fuwei Wang","doi":"10.1109/ICDM.2018.00027","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00027","url":null,"abstract":"For many large-scale multimedia datasets and web contents, the nearest neighbor search methods based on the hashing strategy for cross-modal retrieval have attracted considerable attention due to its fast query speed and low storage cost. Most existing hashing methods try to map different modalities to Hamming embedding in a supervised way where the semantic information comes from a large manual label matrix and each sample in different modalities is usually encoded by a sparse label vector. However, previous studies didn't address the semantic correlation learning challenges and couldn't make the best use of the prior semantic information. Therefore, they cannot preserve the accurate semantic similarities and often degrade the performance of hashing function learning. To fill this gap, we firstly proposed a novel Deep Semantic Correlation learning based Hashing framework (DSCH) that generates unified hash codes in an end-to-end deep learning architecture for cross-modal retrieval task. The major contribution in this work is to effectively automatically construct the semantic correlation between data representation and demonstrate how to utilize correlation information to generate hash codes for new samples. In particular, DSCH integrates latent semantic embedding with a unified hash embedding to strengthen the similarity information among multiple modalities. Furthermore, additional graph regularization is employed in our framework, to capture the correspondences from the inter-modal and intra-modal. Our model simultaneously learns the semantic correlation and the unified hash codes, which enhances the effectiveness of cross-modal retrieval task. Experimental results show the superior accuracy of our proposed approach to several state-of-the-art cross-modality methods on two large datasets.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121938648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In brain network discovery, researchers are interested in discovering brain regions (nodes) and functional connections (edges) between these regions from fMRI scan of human brain. Some recent works propose coherent models to address both of these sub-tasks. However, these approaches either suffer from mathematical inconsistency or fail to distinguish direct connections and indirect connections between the nodes. In this paper, we study the problem of collective discovery of coherent brain regions and direct connections between these regions. Each node of the brain network represents a brain region, i.e., a set of voxels in fMRI with coherent activities. Each edge denotes a direct dependency between two nodes. The discovered brain network represents a Gaussian graphical model that encodes conditional independence between the activities of different brain regions. We propose a novel model, called CGLasso, which combines Graphical Lasso (GLasso) and orthogonal non-negative matrix tri-factorization (ONMtF), to perform nodes discovery and edge detection simultaneously. We perform experiments on synthetic datasets with ground-truth. The results show that the proposed method performs better than the compared baselines in terms of four quantitative metrics. Besides, we also apply the proposed method and other baselines on the real ADHD-200 fMRI dataset. The results demonstrate that our method produces more meaningful networks comparing with other baseline methods.
{"title":"Coherent Graphical Lasso for Brain Network Discovery","authors":"Hang Yin, Xiangnan Kong, Xinyue Liu","doi":"10.1109/ICDM.2018.00191","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00191","url":null,"abstract":"In brain network discovery, researchers are interested in discovering brain regions (nodes) and functional connections (edges) between these regions from fMRI scan of human brain. Some recent works propose coherent models to address both of these sub-tasks. However, these approaches either suffer from mathematical inconsistency or fail to distinguish direct connections and indirect connections between the nodes. In this paper, we study the problem of collective discovery of coherent brain regions and direct connections between these regions. Each node of the brain network represents a brain region, i.e., a set of voxels in fMRI with coherent activities. Each edge denotes a direct dependency between two nodes. The discovered brain network represents a Gaussian graphical model that encodes conditional independence between the activities of different brain regions. We propose a novel model, called CGLasso, which combines Graphical Lasso (GLasso) and orthogonal non-negative matrix tri-factorization (ONMtF), to perform nodes discovery and edge detection simultaneously. We perform experiments on synthetic datasets with ground-truth. The results show that the proposed method performs better than the compared baselines in terms of four quantitative metrics. Besides, we also apply the proposed method and other baselines on the real ADHD-200 fMRI dataset. The results demonstrate that our method produces more meaningful networks comparing with other baseline methods.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121962246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The boom of information technology enables social platforms (like Twitter) to disseminate social content (like news) in an unprecedented rate, which makes early-stage prediction for social content popularity of great practical significance. However, most existing studies assume a long-term observation before prediction and suffer from limited precision for early-stage prediction due to insufficient observation. In this paper, we take a fresh perspective, and propose a novel early pattern aware Bayesian model. The early pattern representation, which stands for early time series normalized on future popularity, can address what we call early-stage indistinctiveness challenge. Then we use an expressive evolving function to fit the time series and estimate three interpretable coefficients characterizing temporal effect of observed series on future evolution. Furthermore, Bayesian network is leveraged to model the probabilistic relations among features, early indicators and early patterns. Experiments on three real-world social platforms (Twitter, Weibo and WeChat) show that under different evaluation metrics, our model outperforms other methods in early-stage prediction and possesses low sensitivity to observation time.
{"title":"EPAB: Early Pattern Aware Bayesian Model for Social Content Popularity Prediction","authors":"Qitian Wu, Chaoqi Yang, Xiaofeng Gao, Peng He, Guihai Chen","doi":"10.1109/ICDM.2018.00175","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00175","url":null,"abstract":"The boom of information technology enables social platforms (like Twitter) to disseminate social content (like news) in an unprecedented rate, which makes early-stage prediction for social content popularity of great practical significance. However, most existing studies assume a long-term observation before prediction and suffer from limited precision for early-stage prediction due to insufficient observation. In this paper, we take a fresh perspective, and propose a novel early pattern aware Bayesian model. The early pattern representation, which stands for early time series normalized on future popularity, can address what we call early-stage indistinctiveness challenge. Then we use an expressive evolving function to fit the time series and estimate three interpretable coefficients characterizing temporal effect of observed series on future evolution. Furthermore, Bayesian network is leveraged to model the probabilistic relations among features, early indicators and early patterns. Experiments on three real-world social platforms (Twitter, Weibo and WeChat) show that under different evaluation metrics, our model outperforms other methods in early-stage prediction and possesses low sensitivity to observation time.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"48 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126051949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in computing resources have made it possible to collect enormous amounts of interconnected data, such as social media interactions, web activity, knowledge bases, product and service purchases, autonomous vehicle routing, smart home sensor data, and more. The massive scale and complexity of this data, however, not only vastly surpasses human processing power, but also goes beyond limitations with regard to computation and storage. That is, there is an urgent need for methods and tools that summarize large interconnected data to enable faster computations, storage reduction, interactive large-scale visualization and understanding, and pattern discovery. Network summarization-which aims to find a small representation of an original, larger graph-features a variety of methods with different goals and for different input data representations (e.g., attributed graphs, time-evolving or streaming graphs, heterogeneous graphs). The objective of this tutorial is to give a systematic overview of methods for summarizing and explaining graphs at different scales: the node-group level, the network level, and the multi-network level. We emphasize the current challenges, present real-world applications, and highlight the open research problems in this vibrant research area.
{"title":"Summarizing Graphs at Multiple Scales: New Trends","authors":"Danai Koutra, Jilles Vreeken, F. Bonchi","doi":"10.1109/ICDM.2018.00141","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00141","url":null,"abstract":"Recent advances in computing resources have made it possible to collect enormous amounts of interconnected data, such as social media interactions, web activity, knowledge bases, product and service purchases, autonomous vehicle routing, smart home sensor data, and more. The massive scale and complexity of this data, however, not only vastly surpasses human processing power, but also goes beyond limitations with regard to computation and storage. That is, there is an urgent need for methods and tools that summarize large interconnected data to enable faster computations, storage reduction, interactive large-scale visualization and understanding, and pattern discovery. Network summarization-which aims to find a small representation of an original, larger graph-features a variety of methods with different goals and for different input data representations (e.g., attributed graphs, time-evolving or streaming graphs, heterogeneous graphs). The objective of this tutorial is to give a systematic overview of methods for summarizing and explaining graphs at different scales: the node-group level, the network level, and the multi-network level. We emphasize the current challenges, present real-world applications, and highlight the open research problems in this vibrant research area.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"72 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123524154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Over the last couple of years, Bitcoin cryptocurrency and the Blockchain technology that forms the basis of Bitcoin have witnessed an unprecedented attention. Designed to facilitate a secure distributed platform without central regulation, Blockchain is heralded as a novel paradigm that will be as powerful as Big Data, Cloud Computing, and Machine Learning. The Blockchain technology garners an ever increasing interest of researchers in various domains that benefit from scalable cooperation among trust-less parties. As Blockchain data analytics further proliferates, a need to glean successful approaches and to disseminate them among a diverse body of data scientists became a critical task. As an inter-disciplinary team of researchers, our aim is to fill this vital role. In this tutorial, we offer a holistic view on Blockchain Data Analytics. Starting with the core components of Blockchain, we will discuss the state of art in Blockchain data analytics for privacy, security, finance, and management domains. We will share tutorial notes and further reading pointers on the tutorial website blockchaintutorial.github.io.
{"title":"Blockchain Data Analytics","authors":"C. Akcora, Murat Kantarcioglu, Y. Gel","doi":"10.1109/ICDM.2018.00013","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00013","url":null,"abstract":"Over the last couple of years, Bitcoin cryptocurrency and the Blockchain technology that forms the basis of Bitcoin have witnessed an unprecedented attention. Designed to facilitate a secure distributed platform without central regulation, Blockchain is heralded as a novel paradigm that will be as powerful as Big Data, Cloud Computing, and Machine Learning. The Blockchain technology garners an ever increasing interest of researchers in various domains that benefit from scalable cooperation among trust-less parties. As Blockchain data analytics further proliferates, a need to glean successful approaches and to disseminate them among a diverse body of data scientists became a critical task. As an inter-disciplinary team of researchers, our aim is to fill this vital role. In this tutorial, we offer a holistic view on Blockchain Data Analytics. Starting with the core components of Blockchain, we will discuss the state of art in Blockchain data analytics for privacy, security, finance, and management domains. We will share tutorial notes and further reading pointers on the tutorial website blockchaintutorial.github.io.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121732702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Takayuki Katsuki, T. Osogami, Akira Koseki, Masaki Ono, M. Kudo, M. Makino, Atsushi Suzuki
This paper proposes a method for modeling event sequences with ambiguous timestamps, a time-discounting convolution. Unlike in ordinary time series, time intervals are not constant, small time-shifts have no significant effect, and inputting timestamps or time durations into a model is not effective. The criteria that we require for the modeling are providing robustness against time-shifts or timestamps uncertainty as well as maintaining the essential capabilities of time-series models, i.e., forgetting meaningless past information and handling infinite sequences. The proposed method handles them with a convolutional mechanism across time with specific parameterizations, which efficiently represents the event dependencies in a time-shift invariant manner while discounting the effect of past events, and a dynamic pooling mechanism, which provides robustness against the uncertainty in timestamps and enhances the time-discounting capability by dynamically changing the pooling window size. In our learning algorithm, the decaying and dynamic pooling mechanisms play critical roles in handling infinite and variable length sequences. Numerical experiments on real-world event sequences with ambiguous timestamps and ordinary time series demonstrated the advantages of our method.
{"title":"Time-Discounting Convolution for Event Sequences with Ambiguous Timestamps","authors":"Takayuki Katsuki, T. Osogami, Akira Koseki, Masaki Ono, M. Kudo, M. Makino, Atsushi Suzuki","doi":"10.1109/ICDM.2018.00139","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00139","url":null,"abstract":"This paper proposes a method for modeling event sequences with ambiguous timestamps, a time-discounting convolution. Unlike in ordinary time series, time intervals are not constant, small time-shifts have no significant effect, and inputting timestamps or time durations into a model is not effective. The criteria that we require for the modeling are providing robustness against time-shifts or timestamps uncertainty as well as maintaining the essential capabilities of time-series models, i.e., forgetting meaningless past information and handling infinite sequences. The proposed method handles them with a convolutional mechanism across time with specific parameterizations, which efficiently represents the event dependencies in a time-shift invariant manner while discounting the effect of past events, and a dynamic pooling mechanism, which provides robustness against the uncertainty in timestamps and enhances the time-discounting capability by dynamically changing the pooling window size. In our learning algorithm, the decaying and dynamic pooling mechanisms play critical roles in handling infinite and variable length sequences. Numerical experiments on real-world event sequences with ambiguous timestamps and ordinary time series demonstrated the advantages of our method.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130730623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
User identity linkage (UIL), the problem of matching user account across multiple online social networks (OSNs), is widely studied and important to many real-world applications. Most existing UIL solutions adopt a supervised or semi-supervised approach which generally suffer from scarcity of labeled data. In this paper, we propose Factoid Embedding, a novel framework that adopts an unsupervised approach. It is designed to cope with different profile attributes, content types and network links of different OSNs. The key idea is that each piece of information about a user identity describes the real identity owner, and thus distinguishes the owner from other users. We represent such a piece of information by a factoid and model it as a triplet consisting of user identity, predicate, and an object or another user identity. By embedding these factoids, we learn the user identity latent representations and link two user identities from different OSNs if they are close to each other in the user embedding space. Our Factoid Embedding algorithm is designed such that as we learn the embedding space, each embedded factoid is "translated" into a motion in the user embedding space to bring similar user identities closer, and different user identities further apart. Extensive experiments are conducted to evaluate Factoid Embedding on two real-world OSNs data sets. The experiment results show that Factoid Embedding outperforms the state-of-the-art methods even without training data.
用户身份链接(User identity linkage, UIL)是指跨多个在线社交网络(online social network, OSNs)匹配用户账户的问题,它被广泛研究,对许多现实应用都很重要。大多数现有的UIL解决方案采用监督或半监督方法,通常受到标记数据稀缺的影响。在本文中,我们提出了一种采用无监督方法的新框架Factoid Embedding。针对不同osn的不同配置文件属性、不同内容类型、不同网络链路而设计。关键思想是,关于用户身份的每条信息都描述了真正的身份所有者,从而将所有者与其他用户区分开来。我们用factoid表示这样的信息,并将其建模为由用户标识、谓词和对象或另一个用户标识组成的三元组。通过嵌入这些factoid,我们学习用户身份的潜在表征,并在用户嵌入空间中,如果两个用户身份在不同的osn中彼此接近,我们将它们链接起来。我们的Factoid嵌入算法是这样设计的:当我们学习嵌入空间时,每个嵌入的Factoid被“翻译”成用户嵌入空间中的运动,从而使相似的用户身份更接近,而不同的用户身份更远。在两个真实的osn数据集上进行了大量的实验来评估Factoid嵌入。实验结果表明,即使没有训练数据,Factoid嵌入也优于最先进的方法。
{"title":"Unsupervised User Identity Linkage via Factoid Embedding","authors":"Wei Xie, Xin Mu, R. Lee, Feida Zhu, Ee-Peng Lim","doi":"10.1109/ICDM.2018.00182","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00182","url":null,"abstract":"User identity linkage (UIL), the problem of matching user account across multiple online social networks (OSNs), is widely studied and important to many real-world applications. Most existing UIL solutions adopt a supervised or semi-supervised approach which generally suffer from scarcity of labeled data. In this paper, we propose Factoid Embedding, a novel framework that adopts an unsupervised approach. It is designed to cope with different profile attributes, content types and network links of different OSNs. The key idea is that each piece of information about a user identity describes the real identity owner, and thus distinguishes the owner from other users. We represent such a piece of information by a factoid and model it as a triplet consisting of user identity, predicate, and an object or another user identity. By embedding these factoids, we learn the user identity latent representations and link two user identities from different OSNs if they are close to each other in the user embedding space. Our Factoid Embedding algorithm is designed such that as we learn the embedding space, each embedded factoid is \"translated\" into a motion in the user embedding space to bring similar user identities closer, and different user identities further apart. Extensive experiments are conducted to evaluate Factoid Embedding on two real-world OSNs data sets. The experiment results show that Factoid Embedding outperforms the state-of-the-art methods even without training data.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131069374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hawkes processes are widely used for modeling event cascades. However, content and cross-domain information which is also instrumental in modeling is usually neglected. In this paper, we propose a novel model called transfer Hybrid Least Square for Hawkes (trHLSH) that incorporates Hawkes processes with content and cross-domain information. We also present the effective learning algorithm for the model. Evaluation on both synthetic and real-world datasets demonstrates that the proposed model can jointly learn knowledge from temporal, content and cross-domain information, and has better performance in terms of network recovery and prediction.
{"title":"Transfer Hawkes Processes with Content Information","authors":"Tianbo Li, Pengfei Wei, Yiping Ke","doi":"10.1109/ICDM.2018.00145","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00145","url":null,"abstract":"Hawkes processes are widely used for modeling event cascades. However, content and cross-domain information which is also instrumental in modeling is usually neglected. In this paper, we propose a novel model called transfer Hybrid Least Square for Hawkes (trHLSH) that incorporates Hawkes processes with content and cross-domain information. We also present the effective learning algorithm for the model. Evaluation on both synthetic and real-world datasets demonstrates that the proposed model can jointly learn knowledge from temporal, content and cross-domain information, and has better performance in terms of network recovery and prediction.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131621637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Russell Reas, Stephen M. Ash, Robert A. Barton, Andrew Borthwick
Identifying sets of items that are equivalent to one another is a problem common to many fields. Systems addressing this generally have at their core a function s(d_i, d_j) for computing the similarity between pairs of records d_i, d_j. The output of s() can be interpreted as a weighted graph where edges indicate the likelihood of two records matching. Partitioning this graph into equivalence classes is non-trivial due to the presence of inconsistencies and imperfections in s(). Numerous algorithmic approaches to the problem have been proposed, but (1) it is unclear which approach should be used on a given dataset; (2) the algorithms do not generally output a confidence in their decisions; and (3) require error-prone tuning to a particular notion of ground truth. We present SuperPart, a scalable, supervised learning approach to graph partitioning. We demonstrate that SuperPart yields competitive results on the problem of detecting equivalent records without manual selection of algorithms or an exhaustive search over hyperparameters. Also, we show the quality of SuperPart's confidence measures by reporting Area Under the Precision-Recall Curve metrics that exceed a baseline measure by 11%. Finally, to bolster additional research in this domain, we release three new datasets derived from real-world Amazon product data along with ground-truth partitionings.
{"title":"SuperPart: Supervised Graph Partitioning for Record Linkage","authors":"Russell Reas, Stephen M. Ash, Robert A. Barton, Andrew Borthwick","doi":"10.1109/ICDM.2018.00054","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00054","url":null,"abstract":"Identifying sets of items that are equivalent to one another is a problem common to many fields. Systems addressing this generally have at their core a function s(d_i, d_j) for computing the similarity between pairs of records d_i, d_j. The output of s() can be interpreted as a weighted graph where edges indicate the likelihood of two records matching. Partitioning this graph into equivalence classes is non-trivial due to the presence of inconsistencies and imperfections in s(). Numerous algorithmic approaches to the problem have been proposed, but (1) it is unclear which approach should be used on a given dataset; (2) the algorithms do not generally output a confidence in their decisions; and (3) require error-prone tuning to a particular notion of ground truth. We present SuperPart, a scalable, supervised learning approach to graph partitioning. We demonstrate that SuperPart yields competitive results on the problem of detecting equivalent records without manual selection of algorithms or an exhaustive search over hyperparameters. Also, we show the quality of SuperPart's confidence measures by reporting Area Under the Precision-Recall Curve metrics that exceed a baseline measure by 11%. Finally, to bolster additional research in this domain, we release three new datasets derived from real-world Amazon product data along with ground-truth partitionings.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127537470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Precise prediction of users' behavior is critical for users' satisfaction and platforms' benefit. A user's behavior heavily depends on the user's general preference and contextual information (current location, weather etc.). In this paper, we propose a succinct hierarchical framework named Hierarchical Hybrid Feature Model (HHFM). It combines users' general taste and diverse contextual information into a hybrid feature representation to profile users' dynamic preference w.r.t context. Meanwhile, we propose an n-way concatenation pooling strategy to capture the non-linear and complex inherent structures of real-world data, which were ignored by most existing methods like Factorization Machines. Conceptually, our model subsumes several existing methods when choosing proper concatenation and pooling strategies. Extensive experiments show our model consistently outperforms state-of-the-art methods on three real-world data sets.
{"title":"Hierarchical Hybrid Feature Model for Top-N Context-Aware Recommendation","authors":"Yingpeng Du, Hongzhi Liu, Zhonghai Wu, Xing Zhang","doi":"10.1109/ICDM.2018.00026","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00026","url":null,"abstract":"Precise prediction of users' behavior is critical for users' satisfaction and platforms' benefit. A user's behavior heavily depends on the user's general preference and contextual information (current location, weather etc.). In this paper, we propose a succinct hierarchical framework named Hierarchical Hybrid Feature Model (HHFM). It combines users' general taste and diverse contextual information into a hybrid feature representation to profile users' dynamic preference w.r.t context. Meanwhile, we propose an n-way concatenation pooling strategy to capture the non-linear and complex inherent structures of real-world data, which were ignored by most existing methods like Factorization Machines. Conceptually, our model subsumes several existing methods when choosing proper concatenation and pooling strategies. Extensive experiments show our model consistently outperforms state-of-the-art methods on three real-world data sets.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127006110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}