L. Rocha, Fernando Mourão, Ramon Vieira, A. Neves, D. Carvalho, Bortik Bandyopadhyay, S. Parthasarathy, R. Ferreira
Social media applications have assumed an important role in decision-making process of users, affecting their choices about products and services. In this context, understanding and modeling opinions, as well as opinion-leaders, have implications for several tasks, such as recommendation, advertising, brand evaluation etc. Despite the intrinsic relation between opinions and opinion-leaders, most recent works focus exclusively on either understanding the opinions, by Sentiment Analysis (SA) proposals, or identifying opinion-leaders using Influential Users Detection (IUD). This paper presents a preliminary evaluation about a combined analysis of SA and IUD. In this sense, we propose a methodology to quantify factors in real domains that may affect such analysis, as well as the potential benefits of combining SA Methods with IUD ones. Empirical assessments on a sample of tweets about the Brazilian president reveal that the collective opinion and the set of top opinion-leaders over time are inter-related. Further, we were able to identify distinct characteristics of opinion propagation, and that the collective opinion may be accurately estimated by using a few top-k opinion-leaders. These results point out the combined analysis of SA and IUD as a promising research direction to be further exploited.
{"title":"Connecting Opinions to Opinion-Leaders: A Case Study on Brazilian Political Protests","authors":"L. Rocha, Fernando Mourão, Ramon Vieira, A. Neves, D. Carvalho, Bortik Bandyopadhyay, S. Parthasarathy, R. Ferreira","doi":"10.1109/DSAA.2016.77","DOIUrl":"https://doi.org/10.1109/DSAA.2016.77","url":null,"abstract":"Social media applications have assumed an important role in decision-making process of users, affecting their choices about products and services. In this context, understanding and modeling opinions, as well as opinion-leaders, have implications for several tasks, such as recommendation, advertising, brand evaluation etc. Despite the intrinsic relation between opinions and opinion-leaders, most recent works focus exclusively on either understanding the opinions, by Sentiment Analysis (SA) proposals, or identifying opinion-leaders using Influential Users Detection (IUD). This paper presents a preliminary evaluation about a combined analysis of SA and IUD. In this sense, we propose a methodology to quantify factors in real domains that may affect such analysis, as well as the potential benefits of combining SA Methods with IUD ones. Empirical assessments on a sample of tweets about the Brazilian president reveal that the collective opinion and the set of top opinion-leaders over time are inter-related. Further, we were able to identify distinct characteristics of opinion propagation, and that the collective opinion may be accurately estimated by using a few top-k opinion-leaders. These results point out the combined analysis of SA and IUD as a promising research direction to be further exploited.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124961425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aditya Parikh, M. Raval, Chandrasinh Parmar, S. Chaudhary
The primary focus of this paper is to detect disease and estimate its stage for a cotton plant using images. Most disease symptoms are reflected on the cotton leaf. Unlike earlier approaches, the novelty of the proposal lies in processing images captured under uncontrolled conditions in the field using normal or a mobile phone camera by an untrained person. Such field images have a cluttered background making leaf segmentation very challenging. The proposed work use two cascaded classifiers. Using local statistical features, first classifier segments leaf from the background. Then using hue and luminance from HSV colour space another classifier is trained to detect disease and find its stage. The developed algorithm is a generalised as it can be applied for any disease. However as a showcase, we detect Grey Mildew, widely prevalent fungal disease in North Gujarat, India.
{"title":"Disease Detection and Severity Estimation in Cotton Plant from Unconstrained Images","authors":"Aditya Parikh, M. Raval, Chandrasinh Parmar, S. Chaudhary","doi":"10.1109/DSAA.2016.81","DOIUrl":"https://doi.org/10.1109/DSAA.2016.81","url":null,"abstract":"The primary focus of this paper is to detect disease and estimate its stage for a cotton plant using images. Most disease symptoms are reflected on the cotton leaf. Unlike earlier approaches, the novelty of the proposal lies in processing images captured under uncontrolled conditions in the field using normal or a mobile phone camera by an untrained person. Such field images have a cluttered background making leaf segmentation very challenging. The proposed work use two cascaded classifiers. Using local statistical features, first classifier segments leaf from the background. Then using hue and luminance from HSV colour space another classifier is trained to detect disease and find its stage. The developed algorithm is a generalised as it can be applied for any disease. However as a showcase, we detect Grey Mildew, widely prevalent fungal disease in North Gujarat, India.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122038660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Extracting structured information from unstructured text is a critical problem. Over the past few years, various clustering algorithms have been proposed to solve this problem. In addition, various algorithms based on probabilistic topic models have been developed to find the hidden thematic structure from various corpora (i.e publications, blogs etc). Both types of algorithms have been transferred to the domain of scientific literature to extract structured information to solve problems like data exploration, expert detection etc. In order to remain domain-agnostic, these algorithms do not exploit the structure present in a scientific publication. Majority of researchers interpret a scientific publication as research conducted to report progress in solving some research problems. Following this interpretation, in this paper we present a different outlook to the same problem by modelling scientific publications around research problems. By associating a scientific publication with a research problem, exploring the scientific literature becomes more intuitive. In this paper, we propose an unsupervised framework to mine research problems from titles and abstracts of scientific literature. Our framework uses weighted frequent phrase mining to generate phrases and filters them to obtain high-quality phrases. These high-quality phrases are then used to segment the scientific publication into meaningful semantic units. After segmenting publications, we apply a number of heuristics to score the phrases and sentences to identify the research problems. In a postprocessing step we use a neighborhood based algorithm to merge different representations of the same problems. Experiments conducted on parts of DBLP dataset show promising results.
{"title":"Mining Research Problems from Scientific Literature","authors":"Chanakya Aalla, Vikram Pudi","doi":"10.1109/DSAA.2016.44","DOIUrl":"https://doi.org/10.1109/DSAA.2016.44","url":null,"abstract":"Extracting structured information from unstructured text is a critical problem. Over the past few years, various clustering algorithms have been proposed to solve this problem. In addition, various algorithms based on probabilistic topic models have been developed to find the hidden thematic structure from various corpora (i.e publications, blogs etc). Both types of algorithms have been transferred to the domain of scientific literature to extract structured information to solve problems like data exploration, expert detection etc. In order to remain domain-agnostic, these algorithms do not exploit the structure present in a scientific publication. Majority of researchers interpret a scientific publication as research conducted to report progress in solving some research problems. Following this interpretation, in this paper we present a different outlook to the same problem by modelling scientific publications around research problems. By associating a scientific publication with a research problem, exploring the scientific literature becomes more intuitive. In this paper, we propose an unsupervised framework to mine research problems from titles and abstracts of scientific literature. Our framework uses weighted frequent phrase mining to generate phrases and filters them to obtain high-quality phrases. These high-quality phrases are then used to segment the scientific publication into meaningful semantic units. After segmenting publications, we apply a number of heuristics to score the phrases and sentences to identify the research problems. In a postprocessing step we use a neighborhood based algorithm to merge different representations of the same problems. Experiments conducted on parts of DBLP dataset show promising results.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131662117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Buying or selling a house is one of the important decisions in a person's life. Online listing websites like "zillow.com", "trulia.com", and "realtor.com" etc. provide significant and effective assistance during the buy/sell process. However, they fail to supply one important information of a house that is, approximately how long will it take for a house to be sold after it first appears in the listing? This information is equally important for both a potential buyer and the seller. With this information the seller will have an understanding of what she can do to expedite the sale, i.e. reduce the asking price, renovate/remodel some home features, etc. On the other hand, a potential buyer will have an idea of the available time for her to react i.e. to place an offer. In this work, we propose a supervised regression (Cox regression) model inspired by survival analysis to predict the sale probability of a house given historical home sale information within an observation time window. We use real-life housing data collected from "trulia.com" to validate the proposed prediction algorithm and show its superior performance over traditional regression methods. We also show how the sale probability of a house is influenced by the values of basic house features, such as price, size, # of bedrooms, # of bathrooms, and school quality.
{"title":"Waiting to Be Sold: Prediction of Time-Dependent House Selling Probability","authors":"Mansurul Bhuiyan, M. Hasan","doi":"10.1109/DSAA.2016.58","DOIUrl":"https://doi.org/10.1109/DSAA.2016.58","url":null,"abstract":"Buying or selling a house is one of the important decisions in a person's life. Online listing websites like \"zillow.com\", \"trulia.com\", and \"realtor.com\" etc. provide significant and effective assistance during the buy/sell process. However, they fail to supply one important information of a house that is, approximately how long will it take for a house to be sold after it first appears in the listing? This information is equally important for both a potential buyer and the seller. With this information the seller will have an understanding of what she can do to expedite the sale, i.e. reduce the asking price, renovate/remodel some home features, etc. On the other hand, a potential buyer will have an idea of the available time for her to react i.e. to place an offer. In this work, we propose a supervised regression (Cox regression) model inspired by survival analysis to predict the sale probability of a house given historical home sale information within an observation time window. We use real-life housing data collected from \"trulia.com\" to validate the proposed prediction algorithm and show its superior performance over traditional regression methods. We also show how the sale probability of a house is influenced by the values of basic house features, such as price, size, # of bedrooms, # of bathrooms, and school quality.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132590488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangfu Meng, Xiaoyan Zhang, Jinguang Sun, Lin Li, Changzheng Xing, Chongchun Bi
Spatial database queries are often exploratory. The users often find that their queries return too many answers and many of them may be irrelevant. Based on the coupling relationships between spatial objects, this paper proposes a novel categorization approach which consists of two steps. The first step analyzes the spatial object coupling relationship by considering the location proximity and semantic similarity between spatial objects, and then a set of clusters over the spatial objects can be generated, where each cluster represents one type of user need. When a user issues a spatial query, the second step presents to the user a category tree which is generated by using modified C4.5 decision tree algorithm over the clusters such that the user can easily select the subset of query results matching his/her needs by exploring the labels assigned on intermediate nodes of the tree. The experiments demonstrate that our spatial object clustering method can efficiently capture both the semantic and location correlations between spatial objects. The effectiveness and efficiency of the categorization algorithm is also demonstrated.
{"title":"A Decision Tree-Based Approach for Categorizing Spatial Database Query Results","authors":"Xiangfu Meng, Xiaoyan Zhang, Jinguang Sun, Lin Li, Changzheng Xing, Chongchun Bi","doi":"10.1109/DSAA.2016.50","DOIUrl":"https://doi.org/10.1109/DSAA.2016.50","url":null,"abstract":"Spatial database queries are often exploratory. The users often find that their queries return too many answers and many of them may be irrelevant. Based on the coupling relationships between spatial objects, this paper proposes a novel categorization approach which consists of two steps. The first step analyzes the spatial object coupling relationship by considering the location proximity and semantic similarity between spatial objects, and then a set of clusters over the spatial objects can be generated, where each cluster represents one type of user need. When a user issues a spatial query, the second step presents to the user a category tree which is generated by using modified C4.5 decision tree algorithm over the clusters such that the user can easily select the subset of query results matching his/her needs by exploring the labels assigned on intermediate nodes of the tree. The experiments demonstrate that our spatial object clustering method can efficiently capture both the semantic and location correlations between spatial objects. The effectiveness and efficiency of the categorization algorithm is also demonstrated.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127127376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The closest interval join problem is to find all the closest intervals between two interval sets R and S. Applications of closest interval join include bioinformatics and other data science. Interval data can be very large and continue to increase in size due to the advancement of data acquisition technology. In this paper, we present efficient MapReduce algorithms to compute closest interval join. Experiments based on both real and synthetic interval data demonstrated that our algorithms are efficient.
{"title":"Closest Interval Join Using MapReduce","authors":"Qiang Zhang, Andy He, Chris Liu, Eric Lo","doi":"10.1109/DSAA.2016.39","DOIUrl":"https://doi.org/10.1109/DSAA.2016.39","url":null,"abstract":"The closest interval join problem is to find all the closest intervals between two interval sets R and S. Applications of closest interval join include bioinformatics and other data science. Interval data can be very large and continue to increase in size due to the advancement of data acquisition technology. In this paper, we present efficient MapReduce algorithms to compute closest interval join. Experiments based on both real and synthetic interval data demonstrated that our algorithms are efficient.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129780136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we introduce "prediction engineering" as a formal step in the predictive modeling process. We define a generalizable 3 part framework — Label, Segment, Featurize (L-S-F) — to address the growing demand for predictive models. The framework provides abstractions for data scientists to customize the process to unique prediction problems. We describe how to apply the L-S-F framework to characteristic problems in 2 domains and demonstrate an implementation over 5 unique prediction problems defined on a dataset of crowdfunding projects from DonorsChoose.org. The results demonstrate how the L-S-F framework complements existing tools to allow us to rapidly build and evaluate 26 distinct predictive models. L-S-F enables development of models that provide value to all parties involved (donors, teachers, and people running the platform).
{"title":"Label, Segment, Featurize: A Cross Domain Framework for Prediction Engineering","authors":"James Max Kanter, O. Gillespie, K. Veeramachaneni","doi":"10.1109/DSAA.2016.54","DOIUrl":"https://doi.org/10.1109/DSAA.2016.54","url":null,"abstract":"In this paper, we introduce \"prediction engineering\" as a formal step in the predictive modeling process. We define a generalizable 3 part framework — Label, Segment, Featurize (L-S-F) — to address the growing demand for predictive models. The framework provides abstractions for data scientists to customize the process to unique prediction problems. We describe how to apply the L-S-F framework to characteristic problems in 2 domains and demonstrate an implementation over 5 unique prediction problems defined on a dataset of crowdfunding projects from DonorsChoose.org. The results demonstrate how the L-S-F framework complements existing tools to allow us to rapidly build and evaluate 26 distinct predictive models. L-S-F enables development of models that provide value to all parties involved (donors, teachers, and people running the platform).","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121389870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Trey Grainger, Khalifeh AlJadda, M. Korayem, Andries Smith
This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain. The source code for our Semantic Knowledge Graph implementation is being published along with this paper to facilitate further research and extensions of this work.
{"title":"The Semantic Knowledge Graph: A Compact, Auto-Generated Model for Real-Time Traversal and Ranking of any Relationship within a Domain","authors":"Trey Grainger, Khalifeh AlJadda, M. Korayem, Andries Smith","doi":"10.1109/DSAA.2016.51","DOIUrl":"https://doi.org/10.1109/DSAA.2016.51","url":null,"abstract":"This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain. The source code for our Semantic Knowledge Graph implementation is being published along with this paper to facilitate further research and extensions of this work.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126771791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hossein Hosseini, Sreeram Kannan, Baosen Zhang, R. Poovendran
We consider the setting where a collection of time series, modeled as random processes, evolve in a causal manner, and one is interested in learning the graph governing the relationships of these processes. A special case of wide interest and applicability is the setting where the noise is Gaussian and relationships are Markov and linear. We study this setting with two additional features: firstly, each random process has a hidden (latent) state, which we use to model the internal memory possessed by the variables (similar to hidden Markov models). Secondly, each variable can depend on its latent memory state through a random lag (rather than a fixed lag), thus modeling memory recall with differing lags at distinct times. Under this setting, we develop an estimator and prove that under a genericity assumption, the parameters of the model can be learned consistently. We also propose a practical adaption of this estimator, which demonstrates significant performance gains in both synthetic and real-world datasets.
{"title":"Learning Temporal Dependence from Time-Series Data with Latent Variables","authors":"Hossein Hosseini, Sreeram Kannan, Baosen Zhang, R. Poovendran","doi":"10.1109/DSAA.2016.34","DOIUrl":"https://doi.org/10.1109/DSAA.2016.34","url":null,"abstract":"We consider the setting where a collection of time series, modeled as random processes, evolve in a causal manner, and one is interested in learning the graph governing the relationships of these processes. A special case of wide interest and applicability is the setting where the noise is Gaussian and relationships are Markov and linear. We study this setting with two additional features: firstly, each random process has a hidden (latent) state, which we use to model the internal memory possessed by the variables (similar to hidden Markov models). Secondly, each variable can depend on its latent memory state through a random lag (rather than a fixed lag), thus modeling memory recall with differing lags at distinct times. Under this setting, we develop an estimator and prove that under a genericity assumption, the parameters of the model can be learned consistently. We also propose a practical adaption of this estimator, which demonstrates significant performance gains in both synthetic and real-world datasets.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129544363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The problem of limiting the diffusion of information in social networks has received substantial attention. To deal with the problem, existing works aim to prevent the diffusion of information to as many nodes as possible, by deleting a given number of edges. Thus, they assume that the diffusing information can affect all nodes and that the deletion of each edge has the same impact on the information propagation properties of the graph. In this work, we propose an approach which lifts these limiting assumptions. Our approach allows specifying the nodes to which information diffusion should be prevented and their maximum allowable activation probability, and it performs edge deletion while avoiding drastic changes to the ability of the network to propagate information. To realize our approach, we propose a measure that captures changes, caused by deletion, to the PageRank distribution of the graph. Based on the measure, we define the problem of finding an edge subset to delete as an optimization problem. We show that the problem can be modeled as a Submodular Set Cover (SSC) problem and design an approximation algorithm, based on the well-known approximation algorithm for SSC. In addition, we develop an iterative heuristic that has similar effectiveness but is significantly more efficient than our algorithm. Experiments on real and synthetic data show the effectiveness and efficiency of our methods.
{"title":"Limiting the Diffusion of Information by a Selective PageRank-Preserving Approach","authors":"G. Loukides, Robert Gwadera","doi":"10.1109/DSAA.2016.16","DOIUrl":"https://doi.org/10.1109/DSAA.2016.16","url":null,"abstract":"The problem of limiting the diffusion of information in social networks has received substantial attention. To deal with the problem, existing works aim to prevent the diffusion of information to as many nodes as possible, by deleting a given number of edges. Thus, they assume that the diffusing information can affect all nodes and that the deletion of each edge has the same impact on the information propagation properties of the graph. In this work, we propose an approach which lifts these limiting assumptions. Our approach allows specifying the nodes to which information diffusion should be prevented and their maximum allowable activation probability, and it performs edge deletion while avoiding drastic changes to the ability of the network to propagate information. To realize our approach, we propose a measure that captures changes, caused by deletion, to the PageRank distribution of the graph. Based on the measure, we define the problem of finding an edge subset to delete as an optimization problem. We show that the problem can be modeled as a Submodular Set Cover (SSC) problem and design an approximation algorithm, based on the well-known approximation algorithm for SSC. In addition, we develop an iterative heuristic that has similar effectiveness but is significantly more efficient than our algorithm. Experiments on real and synthetic data show the effectiveness and efficiency of our methods.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127224401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}