In the past 30 years, tremendous progress has been achieved in building effective shallow classification models. Despite the success, we come to realize that, for many applications, the key bottleneck is not the qualify of classifiers but that of features. Not being able to automatically get useful features has become the main limitation for shallow models. Since 2006, learning high-level features using deep architectures from raw data has become a huge wave of new learning paradigms. In recent two years, deep learning has made many performance breakthroughs, for example, in the areas of image understanding and speech recognition. In this talk, I will walk through some of the latest technology advances of deep learning within Baidu, and discuss the main challenges, e.g., developing effective models for various applications, and scaling up the model training using many GPUs. In the end of the talk I will discuss what might be interesting future directions.
{"title":"Large-scale deep learning at Baidu","authors":"Kai Yu","doi":"10.1145/2505515.2514699","DOIUrl":"https://doi.org/10.1145/2505515.2514699","url":null,"abstract":"In the past 30 years, tremendous progress has been achieved in building effective shallow classification models. Despite the success, we come to realize that, for many applications, the key bottleneck is not the qualify of classifiers but that of features. Not being able to automatically get useful features has become the main limitation for shallow models. Since 2006, learning high-level features using deep architectures from raw data has become a huge wave of new learning paradigms. In recent two years, deep learning has made many performance breakthroughs, for example, in the areas of image understanding and speech recognition. In this talk, I will walk through some of the latest technology advances of deep learning within Baidu, and discuss the main challenges, e.g., developing effective models for various applications, and scaling up the model training using many GPUs. In the end of the talk I will discuss what might be interesting future directions.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89621263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Baraglia, Cristina Ioana Muntean, F. M. Nardini, F. Silvestri
In this paper, we tackle the problem of predicting the "next" geographical position of a tourist given her history (i.e., the prediction is done accordingly to the tourist's current trail) by means of supervised learning techniques, namely Gradient Boosted Regression Trees and Ranking SVM. The learning is done on the basis of an object space represented by a 68 dimension feature vector, specifically designed for tourism related data. Furthermore, we propose a thorough comparison of several methods that are considered state-of-the-art in touristic recommender and trail prediction systems as well as a strong popularity baseline. Experiments show that the methods we propose outperform important competitors and baselines thus providing strong evidence of the performance of our solutions.
{"title":"LearNext: learning to predict tourists movements","authors":"R. Baraglia, Cristina Ioana Muntean, F. M. Nardini, F. Silvestri","doi":"10.1145/2505515.2505656","DOIUrl":"https://doi.org/10.1145/2505515.2505656","url":null,"abstract":"In this paper, we tackle the problem of predicting the \"next\" geographical position of a tourist given her history (i.e., the prediction is done accordingly to the tourist's current trail) by means of supervised learning techniques, namely Gradient Boosted Regression Trees and Ranking SVM. The learning is done on the basis of an object space represented by a 68 dimension feature vector, specifically designed for tourism related data. Furthermore, we propose a thorough comparison of several methods that are considered state-of-the-art in touristic recommender and trail prediction systems as well as a strong popularity baseline. Experiments show that the methods we propose outperform important competitors and baselines thus providing strong evidence of the performance of our solutions.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89758505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The problem of discovering information flow trends and influencers in social networks has become increasingly relevant both because of the increasing amount of content available from online networks in the form of social streams, and because of its relevance as a tool for content trends analysis. An important part of this analysis is to determine the key patterns of flow and corresponding influencers in the underlying network. Almost all the work on influence analysis has focused on fixed models of the network structure, and edge-based transmission between nodes. In this paper, we propose a fully content-centered model of flow analysis in social network streams, in which the analysis is based on actual content transmissions in the network, rather than a static model of transmission on the edges. First, we introduce the problem of information flow mining in social streams, and then propose a novel algorithm InFlowMine to discover the information flow patterns in the network. We then leverage this approach to determine the key influencers in the network. Our approach is flexible, since it can also determine topic-specific influencers. We experimentally show the effectiveness and efficiency of our model.
{"title":"Content-centric flow mining for influence analysis in social streams","authors":"Karthik Subbian, C. Aggarwal, J. Srivastava","doi":"10.1145/2505515.2505626","DOIUrl":"https://doi.org/10.1145/2505515.2505626","url":null,"abstract":"The problem of discovering information flow trends and influencers in social networks has become increasingly relevant both because of the increasing amount of content available from online networks in the form of social streams, and because of its relevance as a tool for content trends analysis. An important part of this analysis is to determine the key patterns of flow and corresponding influencers in the underlying network. Almost all the work on influence analysis has focused on fixed models of the network structure, and edge-based transmission between nodes. In this paper, we propose a fully content-centered model of flow analysis in social network streams, in which the analysis is based on actual content transmissions in the network, rather than a static model of transmission on the edges. First, we introduce the problem of information flow mining in social streams, and then propose a novel algorithm InFlowMine to discover the information flow patterns in the network. We then leverage this approach to determine the key influencers in the network. Our approach is flexible, since it can also determine topic-specific influencers. We experimentally show the effectiveness and efficiency of our model.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"86 5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86499554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dirichlet process mixture (DPM) model is one of the most important Bayesian nonparametric models owing to its efficiency of inference and flexibility for various applications. A fundamental assumption made by DPM model is that all data items are generated from a single, shared DP. This assumption, however, is restrictive in many practical settings where samples are generated from a collection of dependent DPs, each associated with a point in some covariate space. For example, documents in the proceedings of a conference are organized by year, or photos may be tagged and recorded with GPS locations. We present a general method for constructing dependent Dirichlet processes (DP) on arbitrary covariate space. The approach is based on restricting and projecting a DP defined on a space of continuous functions with different domains, which results in a collection of dependent random measures, each associated with a point in covariate space and is marginally DP distributed. The constructed collection of dependent DPs can be used as a nonparametric prior of infinite dynamic mixture models, which allow each mixture component to appear/disappear and vary in a subspace of covariate space. Furthermore, we discuss choices of base distributions of functions in a variety of settings as a flexible method to control dependencies. In addition, we develop an efficient Gibbs sampler for model inference where all underlying random measures are integrated out. Finally, experiment results on temporal modeling and spatial modeling datasets demonstrate the effectiveness of the method in modeling dynamic mixture models on different types of covariates.
{"title":"Functional dirichlet process","authors":"Lijing Qin, Xiaoyan Zhu","doi":"10.1145/2505515.2505537","DOIUrl":"https://doi.org/10.1145/2505515.2505537","url":null,"abstract":"Dirichlet process mixture (DPM) model is one of the most important Bayesian nonparametric models owing to its efficiency of inference and flexibility for various applications. A fundamental assumption made by DPM model is that all data items are generated from a single, shared DP. This assumption, however, is restrictive in many practical settings where samples are generated from a collection of dependent DPs, each associated with a point in some covariate space. For example, documents in the proceedings of a conference are organized by year, or photos may be tagged and recorded with GPS locations. We present a general method for constructing dependent Dirichlet processes (DP) on arbitrary covariate space. The approach is based on restricting and projecting a DP defined on a space of continuous functions with different domains, which results in a collection of dependent random measures, each associated with a point in covariate space and is marginally DP distributed. The constructed collection of dependent DPs can be used as a nonparametric prior of infinite dynamic mixture models, which allow each mixture component to appear/disappear and vary in a subspace of covariate space. Furthermore, we discuss choices of base distributions of functions in a variety of settings as a flexible method to control dependencies. In addition, we develop an efficient Gibbs sampler for model inference where all underlying random measures are integrated out. Finally, experiment results on temporal modeling and spatial modeling datasets demonstrate the effectiveness of the method in modeling dynamic mixture models on different types of covariates.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85817579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frequent sequential pattern mining is a central task in many fields such as biology and finance. However, release of these patterns is raising increasing concerns on individual privacy. In this paper, we study the sequential pattern mining problem under the differential privacy framework which provides formal and provable guarantees of privacy. Due to the nature of the differential privacy mechanism which perturbs the frequency results with noise, and the high dimensionality of the pattern space, this mining problem is particularly challenging. In this work, we propose a novel two-phase algorithm for mining both prefixes and substring patterns. In the first phase, our approach takes advantage of the statistical properties of the data to construct a model-based prefix tree which is used to mine prefixes and a candidate set of substring patterns. The frequency of the substring patterns is further refined in the successive phase where we employ a novel transformation of the original data to reduce the perturbation noise. Extensive experiment results using real datasets showed that our approach is effective for mining both substring and prefix patterns in comparison to the state-of-the-art solutions.
{"title":"A two-phase algorithm for mining sequential patterns with differential privacy","authors":"Luca Bonomi, Li Xiong","doi":"10.1145/2505515.2505553","DOIUrl":"https://doi.org/10.1145/2505515.2505553","url":null,"abstract":"Frequent sequential pattern mining is a central task in many fields such as biology and finance. However, release of these patterns is raising increasing concerns on individual privacy. In this paper, we study the sequential pattern mining problem under the differential privacy framework which provides formal and provable guarantees of privacy. Due to the nature of the differential privacy mechanism which perturbs the frequency results with noise, and the high dimensionality of the pattern space, this mining problem is particularly challenging. In this work, we propose a novel two-phase algorithm for mining both prefixes and substring patterns. In the first phase, our approach takes advantage of the statistical properties of the data to construct a model-based prefix tree which is used to mine prefixes and a candidate set of substring patterns. The frequency of the substring patterns is further refined in the successive phase where we employ a novel transformation of the original data to reduce the perturbation noise. Extensive experiment results using real datasets showed that our approach is effective for mining both substring and prefix patterns in comparison to the state-of-the-art solutions.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78901264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional recrawling methods learn navigation patterns in order to crawl related web pages. However, they cannot remove the redundancy found on the web, especially at the object level. To deal with this problem, we propose a new hypertext resource discovery method, called ``selective recrawling'' for object-level vertical search applications. The goal of selective recrawling is to automatically generate URL patterns, then select those pages that have the widest coverage, and least irrelevance and redundancy relative to a pre-defined vertical domain. This method only requires a few seed objects and can select the set of URL patterns that covers the greatest number of objects. The selected set can continue to be used for some time to recrawl web pages and can be renewed periodically. This leads to significant savings in hardware and network resources. In this paper we present a detailed framework of selective recrawling for object-level vertical search. The selective recrawling method automatically extends the set of candidate websites from initial seed objects. Based on the objects extracted from these websites it learns a set of URL patterns which covers the greatest number of target objects with little redundancy. Finally, the navigation patterns generated from the selected URL pattern set are used to guide future crawling. Experiments on local event data show that our method can greatly reduce downloading of web pages while maintaining comparative object coverage.
{"title":"A pattern-based selective recrawling approach for object-level vertical search","authors":"Yaqian Zhou, Qi Zhang, Xuanjing Huang, Lide Wu","doi":"10.1145/2505515.2505707","DOIUrl":"https://doi.org/10.1145/2505515.2505707","url":null,"abstract":"Traditional recrawling methods learn navigation patterns in order to crawl related web pages. However, they cannot remove the redundancy found on the web, especially at the object level. To deal with this problem, we propose a new hypertext resource discovery method, called ``selective recrawling'' for object-level vertical search applications. The goal of selective recrawling is to automatically generate URL patterns, then select those pages that have the widest coverage, and least irrelevance and redundancy relative to a pre-defined vertical domain. This method only requires a few seed objects and can select the set of URL patterns that covers the greatest number of objects. The selected set can continue to be used for some time to recrawl web pages and can be renewed periodically. This leads to significant savings in hardware and network resources. In this paper we present a detailed framework of selective recrawling for object-level vertical search. The selective recrawling method automatically extends the set of candidate websites from initial seed objects. Based on the objects extracted from these websites it learns a set of URL patterns which covers the greatest number of target objects with little redundancy. Finally, the navigation patterns generated from the selected URL pattern set are used to guide future crawling. Experiments on local event data show that our method can greatly reduce downloading of web pages while maintaining comparative object coverage.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79051073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zi Yang, E. Garduño, Yan Fang, Avner Maiberg, Collin McCormack, Eric Nyberg
Software frameworks which support integration and scaling of text analysis algorithms make it possible to build complex, high performance information systems for information extraction, information retrieval, and question answering; IBM's Watson is a prominent example. As the complexity and scaling of information systems become ever greater, it is much more challenging to effectively and efficiently determine which toolkits, algorithms, knowledge bases or other resources should be integrated into an information system in order to achieve a desired or optimal level of performance on a given task. This paper presents a formal representation of the space of possible system configurations, given a set of information processing components and their parameters (configuration space) and discusses algorithmic approaches to determine the optimal configuration within a given configuration space (configuration space exploration or CSE). We introduce the CSE framework, an extension to the UIMA framework which provides a general distributed solution for building and exploring configuration spaces for information systems. The CSE framework was used to implement biomedical information systems in case studies involving over a trillion different configuration combinations of components and parameter values operating on question answering tasks from the TREC Genomics. The framework automatically and efficiently evaluated different system configurations, and identified configurations that achieved better results than prior published results.
{"title":"Building optimal information systems automatically: configuration space exploration for biomedical information systems","authors":"Zi Yang, E. Garduño, Yan Fang, Avner Maiberg, Collin McCormack, Eric Nyberg","doi":"10.1145/2505515.2505692","DOIUrl":"https://doi.org/10.1145/2505515.2505692","url":null,"abstract":"Software frameworks which support integration and scaling of text analysis algorithms make it possible to build complex, high performance information systems for information extraction, information retrieval, and question answering; IBM's Watson is a prominent example. As the complexity and scaling of information systems become ever greater, it is much more challenging to effectively and efficiently determine which toolkits, algorithms, knowledge bases or other resources should be integrated into an information system in order to achieve a desired or optimal level of performance on a given task. This paper presents a formal representation of the space of possible system configurations, given a set of information processing components and their parameters (configuration space) and discusses algorithmic approaches to determine the optimal configuration within a given configuration space (configuration space exploration or CSE). We introduce the CSE framework, an extension to the UIMA framework which provides a general distributed solution for building and exploring configuration spaces for information systems. The CSE framework was used to implement biomedical information systems in case studies involving over a trillion different configuration combinations of components and parameter values operating on question answering tasks from the TREC Genomics. The framework automatically and efficiently evaluated different system configurations, and identified configurations that achieved better results than prior published results.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77282475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paul N. Bennett, C. Lee Giles, A. Halevy, Jiawei Han, Marti A. Hearst, J. Leskovec
With massive amounts of data being generated and stored ubiquitously in every discipline and every aspect of our daily life, how to handle such big data poses many challenging issues to researchers in data and information systems. The participants of CIKM 2013 are active researchers on large scale data, information and knowledge management, from multiple disciplines, including database systems, data mining, information retrieval, human-computer interaction, and knowledge or information management. As a group of experienced researchers in academia and industry, we will present at this panel our visions on what should be the challenging research issues in this promising research frontier and hope to attract heated discussions and debates from the audience. We expect panelists with diverse backgrounds raise different challenging research problems and exchange their views with each other and with the audience. A heated discussion may help young researchers understand the need for research in both industry and academia and invest their efforts on more important research issues and make impacts to the development of new principles, methodologies, and technologies.
{"title":"Channeling the deluge: research challenges for big data and information systems","authors":"Paul N. Bennett, C. Lee Giles, A. Halevy, Jiawei Han, Marti A. Hearst, J. Leskovec","doi":"10.1145/2505515.2525541","DOIUrl":"https://doi.org/10.1145/2505515.2525541","url":null,"abstract":"With massive amounts of data being generated and stored ubiquitously in every discipline and every aspect of our daily life, how to handle such big data poses many challenging issues to researchers in data and information systems. The participants of CIKM 2013 are active researchers on large scale data, information and knowledge management, from multiple disciplines, including database systems, data mining, information retrieval, human-computer interaction, and knowledge or information management. As a group of experienced researchers in academia and industry, we will present at this panel our visions on what should be the challenging research issues in this promising research frontier and hope to attract heated discussions and debates from the audience. We expect panelists with diverse backgrounds raise different challenging research problems and exchange their views with each other and with the audience. A heated discussion may help young researchers understand the need for research in both industry and academia and invest their efforts on more important research issues and make impacts to the development of new principles, methodologies, and technologies.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"13 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79883982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thanh-Son Nguyen, Hady W. Lauw, Panayiotis Tsaparas
Online reviews are an invaluable resource for web users trying to make decisions regarding products or services. However, the abundance of review content, as well as the unstructured, lengthy, and verbose nature of reviews make it hard for users to locate the appropriate reviews, and distill the useful information. With the recent growth of social networking and micro-blogging services, we observe the emergence of a new type of online review content, consisting of bite-sized, 140 character-long reviews often posted reactively on the spot via mobile devices. These micro-reviews are short, concise, and focused, nicely complementing the lengthy, elaborate, and verbose nature of full-text reviews. We propose a novel methodology that brings together these two diverse types of review content, to obtain something that is more than the sum of its parts. We use micro-reviews as a crowdsourced way to extract the salient aspects of the reviewed item, and propose a new formulation of the review selection problem that aims to find a small set of reviews that efficiently cover the micro-reviews. Our approach consists of a two-step process: matching review sentences to micro-reviews and then selecting reviews such that we cover as many micro-reviews as possible, with few sentences. We perform a detailed evaluation of all the steps of our methodology using data collected from Foursquare and Yelp.
{"title":"Using micro-reviews to select an efficient set of reviews","authors":"Thanh-Son Nguyen, Hady W. Lauw, Panayiotis Tsaparas","doi":"10.1145/2505515.2505568","DOIUrl":"https://doi.org/10.1145/2505515.2505568","url":null,"abstract":"Online reviews are an invaluable resource for web users trying to make decisions regarding products or services. However, the abundance of review content, as well as the unstructured, lengthy, and verbose nature of reviews make it hard for users to locate the appropriate reviews, and distill the useful information. With the recent growth of social networking and micro-blogging services, we observe the emergence of a new type of online review content, consisting of bite-sized, 140 character-long reviews often posted reactively on the spot via mobile devices. These micro-reviews are short, concise, and focused, nicely complementing the lengthy, elaborate, and verbose nature of full-text reviews. We propose a novel methodology that brings together these two diverse types of review content, to obtain something that is more than the sum of its parts. We use micro-reviews as a crowdsourced way to extract the salient aspects of the reviewed item, and propose a new formulation of the review selection problem that aims to find a small set of reviews that efficiently cover the micro-reviews. Our approach consists of a two-step process: matching review sentences to micro-reviews and then selecting reviews such that we cover as many micro-reviews as possible, with few sentences. We perform a detailed evaluation of all the steps of our methodology using data collected from Foursquare and Yelp.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"70 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86271808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper explores users' browsing intents to predict the category of a user's next access during web surfing, and applies the results to objectionable content filtering. A user's access trail represented as a sequence of URLs reveals the contextual information of web browsing behaviors. We extract behavioral features of each clicked URL, i.e., hostname, bag-of-words, gTLD, IP, and port, to develop a linear chain CRF model for context-aware category prediction. Large-scale experiments show that our method achieves a promising accuracy of 0.9396 for objectionable access identification without requesting their corresponding page content. Error analysis indicates that our proposed model results in a low false positive rate of 0.0571. In real-life filtering simulations, our proposed model accomplishes macro-averaging blocking rate 0.9271, while maintaining a favorably low macro-averaging over-blocking rate 0.0575 for collaboratively filtering objectionable content with time change on the dynamic web.
{"title":"Objectionable content filtering by click-through data","authors":"Lung-Hao Lee, Yen-Cheng Juan, Hsin-Hsi Chen, Yuen-Hsien Tseng","doi":"10.1145/2505515.2507849","DOIUrl":"https://doi.org/10.1145/2505515.2507849","url":null,"abstract":"This paper explores users' browsing intents to predict the category of a user's next access during web surfing, and applies the results to objectionable content filtering. A user's access trail represented as a sequence of URLs reveals the contextual information of web browsing behaviors. We extract behavioral features of each clicked URL, i.e., hostname, bag-of-words, gTLD, IP, and port, to develop a linear chain CRF model for context-aware category prediction. Large-scale experiments show that our method achieves a promising accuracy of 0.9396 for objectionable access identification without requesting their corresponding page content. Error analysis indicates that our proposed model results in a low false positive rate of 0.0571. In real-life filtering simulations, our proposed model accomplishes macro-averaging blocking rate 0.9271, while maintaining a favorably low macro-averaging over-blocking rate 0.0575 for collaboratively filtering objectionable content with time change on the dynamic web.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"62 7","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91553620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}