Active learning has been demonstrated to be a powerful tool for improving the effectiveness of binary classifiers. It iteratively identifies informative unlabeled examples which after labeling are used to augment the initial training set. Adapting the procedure to large-scale, multi-class classification problems, however, poses certain challenges. For instance, to guarantee improvement by the method we may need to select a large number of examples that require prohibitive labeling resources. Furthermore, the notion of informative examples also changes significantly when multiple classes are considered. In this paper we show that multi-class active learning can be cast into an integer programming framework, where a subset of examples that are informative across maximum number of classes is selected. We test our approach on several large-scale document categorization problems. We demonstrate that in the case of limited labeling resources and large number of classes the proposed method is more effective compared to other known approaches.
{"title":"Integer Programming for Multi-class Active Learning","authors":"Dragomir Yankov, Suju Rajan, A. Ratnaparkhi","doi":"10.1109/ICDMW.2010.148","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.148","url":null,"abstract":"Active learning has been demonstrated to be a powerful tool for improving the effectiveness of binary classifiers. It iteratively identifies informative unlabeled examples which after labeling are used to augment the initial training set. Adapting the procedure to large-scale, multi-class classification problems, however, poses certain challenges. For instance, to guarantee improvement by the method we may need to select a large number of examples that require prohibitive labeling resources. Furthermore, the notion of informative examples also changes significantly when multiple classes are considered. In this paper we show that multi-class active learning can be cast into an integer programming framework, where a subset of examples that are informative across maximum number of classes is selected. We test our approach on several large-scale document categorization problems. We demonstrate that in the case of limited labeling resources and large number of classes the proposed method is more effective compared to other known approaches.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128357688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Finding relevant documents in digital libraries has been a well studied problem in information retrieval. It is not uncommon to see users browsing digital collections without having a clear idea of the keyword search that they should perform. However, we believe that such initial query search is not totally independent from the target search. Therefore, we use these initial document selections to further explore these documents. In the following demonstration, we exploit On-line Analytical Processing (OLAP) for knowledge discovery in digital collections to achieve query refinement. Such refinement is the result of applying a traditional ranking technique, based on the vector space model, selecting the top keywords in the resulting subset of documents, and then displaying certain cuboids of the keywords. Based on these cuboids, which are ranked by their frequency, the users can select a query that can better represent their actual target search. We show that this document exploration can be done efficiently within the DBMS and exploit in-database extensions, such as User-Defined Functions, as well as standard SQL. Additionally, we demonstrate a novel approach to obtaining query refinement through OLAP data cubes.
{"title":"Enhancing Document Exploration with OLAP","authors":"Zhibo Chen, Carlos Garcia-Alvarado, C. Ordonez","doi":"10.1109/ICDMW.2010.37","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.37","url":null,"abstract":"Finding relevant documents in digital libraries has been a well studied problem in information retrieval. It is not uncommon to see users browsing digital collections without having a clear idea of the keyword search that they should perform. However, we believe that such initial query search is not totally independent from the target search. Therefore, we use these initial document selections to further explore these documents. In the following demonstration, we exploit On-line Analytical Processing (OLAP) for knowledge discovery in digital collections to achieve query refinement. Such refinement is the result of applying a traditional ranking technique, based on the vector space model, selecting the top keywords in the resulting subset of documents, and then displaying certain cuboids of the keywords. Based on these cuboids, which are ranked by their frequency, the users can select a query that can better represent their actual target search. We show that this document exploration can be done efficiently within the DBMS and exploit in-database extensions, such as User-Defined Functions, as well as standard SQL. Additionally, we demonstrate a novel approach to obtaining query refinement through OLAP data cubes.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126901236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a probabilistic co-clustering approach to pattern discovery in preference data. We extended the original formulation of the block mixture model to handle rating data, the resulting model allows the simultaneous clustering of users and items in homogeneous user communities and item categories. The parameter of the model are determined using a variational approximation and a two-phase application of the EM algorithm. The experimental evaluation showed that proposed approach can be used both for rating prediction and pattern discovery tasks, such as the analysis of common trends within the same user community and the identification of interesting relationships between products belonging to the same item category. In particular, using Movie Lens data, we show how it is possibile to infer topics for each item category, and how to model community interests and transition among topics of interest.
{"title":"A Block Mixture Model for Pattern Discovery in Preference Data","authors":"Nicola Barbieri, M. Guarascio, G. Manco","doi":"10.1109/ICDMW.2010.59","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.59","url":null,"abstract":"This paper presents a probabilistic co-clustering approach to pattern discovery in preference data. We extended the original formulation of the block mixture model to handle rating data, the resulting model allows the simultaneous clustering of users and items in homogeneous user communities and item categories. The parameter of the model are determined using a variational approximation and a two-phase application of the EM algorithm. The experimental evaluation showed that proposed approach can be used both for rating prediction and pattern discovery tasks, such as the analysis of common trends within the same user community and the identification of interesting relationships between products belonging to the same item category. In particular, using Movie Lens data, we show how it is possibile to infer topics for each item category, and how to model community interests and transition among topics of interest.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"167 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126939481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Privacy preserving micro data publication has received wide attentions. In this paper, we investigate the randomization approach and focus on attribute disclosure under linking attacks. We give efficient solutions to determine optimal distortion parameters such that we can maximize utility preservation while still satisfying privacy requirements. We compare our randomization approach with l-diversity and anatomy in terms of utility preservation (under the same privacy requirements) from three aspects (reconstructed distributions, accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss.
{"title":"On Attribute Disclosure in Randomization Based Privacy Preserving Data Publishing","authors":"Ling Guo, Xiaowei Ying, Xintao Wu","doi":"10.1109/ICDMW.2010.76","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.76","url":null,"abstract":"Privacy preserving micro data publication has received wide attentions. In this paper, we investigate the randomization approach and focus on attribute disclosure under linking attacks. We give efficient solutions to determine optimal distortion parameters such that we can maximize utility preservation while still satisfying privacy requirements. We compare our randomization approach with l-diversity and anatomy in terms of utility preservation (under the same privacy requirements) from three aspects (reconstructed distributions, accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116781217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Afendi, L. K. Darusman, Aki Hirai, M. Altaf-Ul-Amin, Hiroki Takahashi, Kensuke Nakamura, S. Kanaya
Jamu is Indonesian herbal medicine made from a mixture of several plants. Some plants perform as main ingredients and the others as supporting ingredients. By utilizing biplot configuration, we explored the relationship between Indonesian herbal plants and the efficacy of jamu. Among 465 plants used in 3138 jamu, we determined that 190 plants were efficacious in at least one efficacy. We therefore consider these plants to be the main ingredients of jamu. The other 275 plants are considered to be supporting ingredients in jamu because their efficacy has not been established.
{"title":"System Biology Approach for Elucidating the Relationship Between Indonesian Herbal Plants and the Efficacy of Jamu","authors":"F. Afendi, L. K. Darusman, Aki Hirai, M. Altaf-Ul-Amin, Hiroki Takahashi, Kensuke Nakamura, S. Kanaya","doi":"10.1109/ICDMW.2010.105","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.105","url":null,"abstract":"Jamu is Indonesian herbal medicine made from a mixture of several plants. Some plants perform as main ingredients and the others as supporting ingredients. By utilizing biplot configuration, we explored the relationship between Indonesian herbal plants and the efficacy of jamu. Among 465 plants used in 3138 jamu, we determined that 190 plants were efficacious in at least one efficacy. We therefore consider these plants to be the main ingredients of jamu. The other 275 plants are considered to be supporting ingredients in jamu because their efficacy has not been established.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129701305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Online Social Networks are so popular nowadays that they are a major component of an individual’s social interaction. They are also emotionally-rich environments where close friends share their emotions, feelings and thoughts. In this paper, a new framework is proposed for characterizing emotional interactions in social networks, and then using these characteristics to distinguish friends from acquaintances. The goal is to extract the emotional content of texts in online social networks. The interest is in whether the text is an expression of the writer’s emotions or not. For this purpose, text mining techniques are performed on comments retrieved from a social network. The framework includes a model for data collection, database schemas, data processing and data mining steps. The informal language of online social networks is a main point to consider before performing any text mining techniques. This is why the framework includes the development of special lexicons. In general, the paper presents a new perspective for studying friendship relations and emotions’ expression in online social networks where it deals with the nature of these sites and the nature of the language used. It considers Lebanese Face book users as a case study. The technique adopted is unsupervised, it mainly uses the k-means clustering algorithm. Experiments show high accuracy for the model in both determining subjectivity of texts and predicting friendship.
{"title":"A Framework for Emotion Mining from Text in Online Social Networks","authors":"Mohamed Yassine, Hazem M. Hajj","doi":"10.1109/ICDMW.2010.75","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.75","url":null,"abstract":"Online Social Networks are so popular nowadays that they are a major component of an individual’s social interaction. They are also emotionally-rich environments where close friends share their emotions, feelings and thoughts. In this paper, a new framework is proposed for characterizing emotional interactions in social networks, and then using these characteristics to distinguish friends from acquaintances. The goal is to extract the emotional content of texts in online social networks. The interest is in whether the text is an expression of the writer’s emotions or not. For this purpose, text mining techniques are performed on comments retrieved from a social network. The framework includes a model for data collection, database schemas, data processing and data mining steps. The informal language of online social networks is a main point to consider before performing any text mining techniques. This is why the framework includes the development of special lexicons. In general, the paper presents a new perspective for studying friendship relations and emotions’ expression in online social networks where it deals with the nature of these sites and the nature of the language used. It considers Lebanese Face book users as a case study. The technique adopted is unsupervised, it mainly uses the k-means clustering algorithm. Experiments show high accuracy for the model in both determining subjectivity of texts and predicting friendship.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128866960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sarika Mittal, Jothi Swarubini Vindhiya Varman, Gloria Chatzopoulou, M. Eirinaki, N. Polyzotis
This demonstration presents QueRIE, a recommender system that supports interactive database exploration. This system aims at assisting non-expert users of scientific databases by generating personalized query recommendations. Drawing inspiration from Web recommender systems, QueRIE tracks the querying behavior of each user and identifies potentially “interesting” parts of the database related to the corresponding data analysis task by locating those database parts that were accessed by similar users in the past. It then generates and recommends the queries that cover those parts to the user.
{"title":"QueRIE: A Query Recommender System Supporting Interactive Database Exploration","authors":"Sarika Mittal, Jothi Swarubini Vindhiya Varman, Gloria Chatzopoulou, M. Eirinaki, N. Polyzotis","doi":"10.1109/ICDMW.2010.43","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.43","url":null,"abstract":"This demonstration presents QueRIE, a recommender system that supports interactive database exploration. This system aims at assisting non-expert users of scientific databases by generating personalized query recommendations. Drawing inspiration from Web recommender systems, QueRIE tracks the querying behavior of each user and identifies potentially “interesting” parts of the database related to the corresponding data analysis task by locating those database parts that were accessed by similar users in the past. It then generates and recommends the queries that cover those parts to the user.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"286 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124565569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Bagherjeiran, A. O. Hatch, A. Ratnaparkhi, R. Parekh
Performance advertisers want to maximize the return on their advertising spend. In the online advertising world, this means showing the ad only to those users most likely to convert i.e. buy a product or service. Existing ad targeting solutions such as context targeting and rule-based segment targeting primarily leverage marketing intuition to identify audience segments that would be likely to convert. Even the more sophisticated model-based approaches such as behavioral targeting identify audience segments interested in certain coarse-grained categories defined by the publisher. Advertisers are now able, through beaconing, to tell us exactly who their preferred customers are. Advertisers want to augment their existing advertising campaign with custom models that learn from the campaign and focus on attracting new users. Motivated by our experience with advertisers, we pose this problem within the context of ensemble learning. Building custom models for an existing ad campaign can be viewed as operations on an ensemble classifier: add, modify, or complement a classifier. An ideal new classifier should incrementally improve the ensemble and minimize overlap with any existing classifiers already in the ensemble–it should learn something new. With the proposed approach we are able to augment the advertising campaigns of several large advertisers at a large online advertising company.
{"title":"Large-Scale Customized Models for Advertisers","authors":"A. Bagherjeiran, A. O. Hatch, A. Ratnaparkhi, R. Parekh","doi":"10.1109/ICDMW.2010.157","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.157","url":null,"abstract":"Performance advertisers want to maximize the return on their advertising spend. In the online advertising world, this means showing the ad only to those users most likely to convert i.e. buy a product or service. Existing ad targeting solutions such as context targeting and rule-based segment targeting primarily leverage marketing intuition to identify audience segments that would be likely to convert. Even the more sophisticated model-based approaches such as behavioral targeting identify audience segments interested in certain coarse-grained categories defined by the publisher. Advertisers are now able, through beaconing, to tell us exactly who their preferred customers are. Advertisers want to augment their existing advertising campaign with custom models that learn from the campaign and focus on attracting new users. Motivated by our experience with advertisers, we pose this problem within the context of ensemble learning. Building custom models for an existing ad campaign can be viewed as operations on an ensemble classifier: add, modify, or complement a classifier. An ideal new classifier should incrementally improve the ensemble and minimize overlap with any existing classifiers already in the ensemble–it should learn something new. With the proposed approach we are able to augment the advertising campaigns of several large advertisers at a large online advertising company.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121777318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dwi A. P. Rahayu, S. Krishnaswamy, O. Alahakoon, C. Labbé
Review mining is a part of web mining which focuses on getting main information from user review. State of the art review mining systems focus on identifying semantic orientation of reviews and providing sentences or feature scores. There has been little focus on understanding the rationale for the ratings that are provided. This paper presents our proposed RnR system for extracting rationale from online reviews and ratings. We have implemented the system for evaluation on online reviews for hotels from TripAdvisor.com and present extensive experimental evaluation that demonstrates the improved computational performance of our approach and the accuracy in terms of identifying the rationale. This RnR system is available for testing from http://rnrsystem.com/RnRSystem
{"title":"RnR: Extracting Rationale from Online Reviews and Ratings","authors":"Dwi A. P. Rahayu, S. Krishnaswamy, O. Alahakoon, C. Labbé","doi":"10.1109/ICDMW.2010.167","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.167","url":null,"abstract":"Review mining is a part of web mining which focuses on getting main information from user review. State of the art review mining systems focus on identifying semantic orientation of reviews and providing sentences or feature scores. There has been little focus on understanding the rationale for the ratings that are provided. This paper presents our proposed RnR system for extracting rationale from online reviews and ratings. We have implemented the system for evaluation on online reviews for hotels from TripAdvisor.com and present extensive experimental evaluation that demonstrates the improved computational performance of our approach and the accuracy in terms of identifying the rationale. This RnR system is available for testing from http://rnrsystem.com/RnRSystem","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"193 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126079258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model, providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers. In this paper, we outline the S4 architecture in detail, describe various applications, including real-life deployments. Our design is primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware.
{"title":"S4: Distributed Stream Computing Platform","authors":"L. Neumeyer, B. Robbins, Anish Nair, Anand Kesari","doi":"10.1109/ICDMW.2010.172","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.172","url":null,"abstract":"S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model, providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers. In this paper, we outline the S4 architecture in detail, describe various applications, including real-life deployments. Our design is primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"162 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125456492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}