首页 > 最新文献

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)最新文献

英文 中文
Connecting Opinions to Opinion-Leaders: A Case Study on Brazilian Political Protests 将意见与意见领袖联系起来:巴西政治抗议的案例研究
L. Rocha, Fernando Mourão, Ramon Vieira, A. Neves, D. Carvalho, Bortik Bandyopadhyay, S. Parthasarathy, R. Ferreira
Social media applications have assumed an important role in decision-making process of users, affecting their choices about products and services. In this context, understanding and modeling opinions, as well as opinion-leaders, have implications for several tasks, such as recommendation, advertising, brand evaluation etc. Despite the intrinsic relation between opinions and opinion-leaders, most recent works focus exclusively on either understanding the opinions, by Sentiment Analysis (SA) proposals, or identifying opinion-leaders using Influential Users Detection (IUD). This paper presents a preliminary evaluation about a combined analysis of SA and IUD. In this sense, we propose a methodology to quantify factors in real domains that may affect such analysis, as well as the potential benefits of combining SA Methods with IUD ones. Empirical assessments on a sample of tweets about the Brazilian president reveal that the collective opinion and the set of top opinion-leaders over time are inter-related. Further, we were able to identify distinct characteristics of opinion propagation, and that the collective opinion may be accurately estimated by using a few top-k opinion-leaders. These results point out the combined analysis of SA and IUD as a promising research direction to be further exploited.
社交媒体应用在用户的决策过程中扮演着重要的角色,影响着用户对产品和服务的选择。在这种情况下,理解和建模意见,以及意见领袖,对一些任务有影响,如推荐,广告,品牌评估等。尽管意见和意见领袖之间存在内在联系,但最近的研究主要集中在通过情感分析(SA)建议来理解意见,或者使用有影响力的用户检测(IUD)来识别意见领袖。本文对SA与宫内节育器联合分析进行了初步评价。从这个意义上说,我们提出了一种方法来量化可能影响这种分析的实际领域的因素,以及将SA方法与宫内节育器方法相结合的潜在好处。对有关巴西总统的推文样本的实证评估显示,随着时间的推移,集体意见和一组顶级意见领袖是相互关联的。此外,我们能够识别意见传播的独特特征,并且可以通过使用几个top-k意见领袖来准确估计集体意见。这些结果表明,SA与IUD的联合分析是一个有前景的研究方向,值得进一步开发。
{"title":"Connecting Opinions to Opinion-Leaders: A Case Study on Brazilian Political Protests","authors":"L. Rocha, Fernando Mourão, Ramon Vieira, A. Neves, D. Carvalho, Bortik Bandyopadhyay, S. Parthasarathy, R. Ferreira","doi":"10.1109/DSAA.2016.77","DOIUrl":"https://doi.org/10.1109/DSAA.2016.77","url":null,"abstract":"Social media applications have assumed an important role in decision-making process of users, affecting their choices about products and services. In this context, understanding and modeling opinions, as well as opinion-leaders, have implications for several tasks, such as recommendation, advertising, brand evaluation etc. Despite the intrinsic relation between opinions and opinion-leaders, most recent works focus exclusively on either understanding the opinions, by Sentiment Analysis (SA) proposals, or identifying opinion-leaders using Influential Users Detection (IUD). This paper presents a preliminary evaluation about a combined analysis of SA and IUD. In this sense, we propose a methodology to quantify factors in real domains that may affect such analysis, as well as the potential benefits of combining SA Methods with IUD ones. Empirical assessments on a sample of tweets about the Brazilian president reveal that the collective opinion and the set of top opinion-leaders over time are inter-related. Further, we were able to identify distinct characteristics of opinion propagation, and that the collective opinion may be accurately estimated by using a few top-k opinion-leaders. These results point out the combined analysis of SA and IUD as a promising research direction to be further exploited.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124961425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Disease Detection and Severity Estimation in Cotton Plant from Unconstrained Images 基于无约束图像的棉花病害检测与严重程度估计
Aditya Parikh, M. Raval, Chandrasinh Parmar, S. Chaudhary
The primary focus of this paper is to detect disease and estimate its stage for a cotton plant using images. Most disease symptoms are reflected on the cotton leaf. Unlike earlier approaches, the novelty of the proposal lies in processing images captured under uncontrolled conditions in the field using normal or a mobile phone camera by an untrained person. Such field images have a cluttered background making leaf segmentation very challenging. The proposed work use two cascaded classifiers. Using local statistical features, first classifier segments leaf from the background. Then using hue and luminance from HSV colour space another classifier is trained to detect disease and find its stage. The developed algorithm is a generalised as it can be applied for any disease. However as a showcase, we detect Grey Mildew, widely prevalent fungal disease in North Gujarat, India.
本文的主要研究重点是利用图像对棉花植株进行病害检测和阶段估计。大多数疾病症状都反映在棉花叶片上。与之前的方法不同,该提议的新颖之处在于,由未经训练的人使用普通相机或手机相机处理在现场不受控制的条件下拍摄的图像。这样的田野图像有一个杂乱的背景,使得树叶分割非常具有挑战性。建议的工作使用两个级联分类器。利用局部统计特征,首先分类器从背景中分割叶子。然后利用HSV色彩空间中的色调和亮度训练另一个分类器来检测疾病并确定其阶段。所开发的算法是一般化的,因为它可以应用于任何疾病。然而,作为一个展示,我们检测灰霉,广泛流行的真菌疾病在北古吉拉特邦,印度。
{"title":"Disease Detection and Severity Estimation in Cotton Plant from Unconstrained Images","authors":"Aditya Parikh, M. Raval, Chandrasinh Parmar, S. Chaudhary","doi":"10.1109/DSAA.2016.81","DOIUrl":"https://doi.org/10.1109/DSAA.2016.81","url":null,"abstract":"The primary focus of this paper is to detect disease and estimate its stage for a cotton plant using images. Most disease symptoms are reflected on the cotton leaf. Unlike earlier approaches, the novelty of the proposal lies in processing images captured under uncontrolled conditions in the field using normal or a mobile phone camera by an untrained person. Such field images have a cluttered background making leaf segmentation very challenging. The proposed work use two cascaded classifiers. Using local statistical features, first classifier segments leaf from the background. Then using hue and luminance from HSV colour space another classifier is trained to detect disease and find its stage. The developed algorithm is a generalised as it can be applied for any disease. However as a showcase, we detect Grey Mildew, widely prevalent fungal disease in North Gujarat, India.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122038660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 52
Mining Research Problems from Scientific Literature 从科学文献中挖掘研究问题
Chanakya Aalla, Vikram Pudi
Extracting structured information from unstructured text is a critical problem. Over the past few years, various clustering algorithms have been proposed to solve this problem. In addition, various algorithms based on probabilistic topic models have been developed to find the hidden thematic structure from various corpora (i.e publications, blogs etc). Both types of algorithms have been transferred to the domain of scientific literature to extract structured information to solve problems like data exploration, expert detection etc. In order to remain domain-agnostic, these algorithms do not exploit the structure present in a scientific publication. Majority of researchers interpret a scientific publication as research conducted to report progress in solving some research problems. Following this interpretation, in this paper we present a different outlook to the same problem by modelling scientific publications around research problems. By associating a scientific publication with a research problem, exploring the scientific literature becomes more intuitive. In this paper, we propose an unsupervised framework to mine research problems from titles and abstracts of scientific literature. Our framework uses weighted frequent phrase mining to generate phrases and filters them to obtain high-quality phrases. These high-quality phrases are then used to segment the scientific publication into meaningful semantic units. After segmenting publications, we apply a number of heuristics to score the phrases and sentences to identify the research problems. In a postprocessing step we use a neighborhood based algorithm to merge different representations of the same problems. Experiments conducted on parts of DBLP dataset show promising results.
从非结构化文本中提取结构化信息是一个关键问题。在过去的几年里,人们提出了各种聚类算法来解决这个问题。此外,人们还开发了各种基于概率主题模型的算法,从各种语料库(如出版物、博客等)中发现隐藏的主题结构。这两种算法已经被转移到科学文献领域,用于提取结构化信息,以解决数据探索、专家检测等问题。为了保持领域不可知论,这些算法不利用科学出版物中存在的结构。大多数研究人员将科学出版物解释为为报告解决某些研究问题的进展而进行的研究。根据这一解释,在本文中,我们通过围绕研究问题对科学出版物进行建模,对同一问题提出了不同的看法。通过将科学出版物与研究问题联系起来,探索科学文献变得更加直观。在本文中,我们提出了一个从科学文献的标题和摘要中挖掘研究问题的无监督框架。我们的框架使用加权频繁短语挖掘来生成短语并对其进行过滤以获得高质量的短语。然后使用这些高质量的短语将科学出版物分割成有意义的语义单位。在对出版物进行分割后,我们应用一些启发式方法对短语和句子进行评分,以确定研究问题。在后处理步骤中,我们使用基于邻域的算法来合并相同问题的不同表示。在部分DBLP数据集上进行了实验,取得了令人满意的结果。
{"title":"Mining Research Problems from Scientific Literature","authors":"Chanakya Aalla, Vikram Pudi","doi":"10.1109/DSAA.2016.44","DOIUrl":"https://doi.org/10.1109/DSAA.2016.44","url":null,"abstract":"Extracting structured information from unstructured text is a critical problem. Over the past few years, various clustering algorithms have been proposed to solve this problem. In addition, various algorithms based on probabilistic topic models have been developed to find the hidden thematic structure from various corpora (i.e publications, blogs etc). Both types of algorithms have been transferred to the domain of scientific literature to extract structured information to solve problems like data exploration, expert detection etc. In order to remain domain-agnostic, these algorithms do not exploit the structure present in a scientific publication. Majority of researchers interpret a scientific publication as research conducted to report progress in solving some research problems. Following this interpretation, in this paper we present a different outlook to the same problem by modelling scientific publications around research problems. By associating a scientific publication with a research problem, exploring the scientific literature becomes more intuitive. In this paper, we propose an unsupervised framework to mine research problems from titles and abstracts of scientific literature. Our framework uses weighted frequent phrase mining to generate phrases and filters them to obtain high-quality phrases. These high-quality phrases are then used to segment the scientific publication into meaningful semantic units. After segmenting publications, we apply a number of heuristics to score the phrases and sentences to identify the research problems. In a postprocessing step we use a neighborhood based algorithm to merge different representations of the same problems. Experiments conducted on parts of DBLP dataset show promising results.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131662117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Waiting to Be Sold: Prediction of Time-Dependent House Selling Probability 待售:随时间变化的房屋销售概率预测
Mansurul Bhuiyan, M. Hasan
Buying or selling a house is one of the important decisions in a person's life. Online listing websites like "zillow.com", "trulia.com", and "realtor.com" etc. provide significant and effective assistance during the buy/sell process. However, they fail to supply one important information of a house that is, approximately how long will it take for a house to be sold after it first appears in the listing? This information is equally important for both a potential buyer and the seller. With this information the seller will have an understanding of what she can do to expedite the sale, i.e. reduce the asking price, renovate/remodel some home features, etc. On the other hand, a potential buyer will have an idea of the available time for her to react i.e. to place an offer. In this work, we propose a supervised regression (Cox regression) model inspired by survival analysis to predict the sale probability of a house given historical home sale information within an observation time window. We use real-life housing data collected from "trulia.com" to validate the proposed prediction algorithm and show its superior performance over traditional regression methods. We also show how the sale probability of a house is influenced by the values of basic house features, such as price, size, # of bedrooms, # of bathrooms, and school quality.
买房或卖房是一个人一生中最重要的决定之一。像“zillow.com”、“trulia.com”和“realtor.com”这样的在线挂牌网站在买卖过程中提供了重要而有效的帮助。然而,他们没有提供关于房子的一个重要信息,那就是,从房子第一次出现在清单上大约需要多长时间才能卖出?这些信息对潜在的买家和卖家都同样重要。有了这些信息,卖方就会明白她可以做些什么来加快销售,即降低要价,翻新/改造一些房屋特征,等等。另一方面,潜在的买家会知道她有多少时间可以做出反应,也就是提出报价。在这项工作中,我们提出了一种受生存分析启发的监督回归(Cox回归)模型,用于预测给定历史房屋销售信息在观察时间窗口内房屋的销售概率。我们使用来自“trulia.com”的真实住房数据来验证所提出的预测算法,并显示其优于传统回归方法的性能。我们还展示了房屋的销售概率如何受到房屋基本特征(如价格、大小、卧室数量、浴室数量和学校质量)价值的影响。
{"title":"Waiting to Be Sold: Prediction of Time-Dependent House Selling Probability","authors":"Mansurul Bhuiyan, M. Hasan","doi":"10.1109/DSAA.2016.58","DOIUrl":"https://doi.org/10.1109/DSAA.2016.58","url":null,"abstract":"Buying or selling a house is one of the important decisions in a person's life. Online listing websites like \"zillow.com\", \"trulia.com\", and \"realtor.com\" etc. provide significant and effective assistance during the buy/sell process. However, they fail to supply one important information of a house that is, approximately how long will it take for a house to be sold after it first appears in the listing? This information is equally important for both a potential buyer and the seller. With this information the seller will have an understanding of what she can do to expedite the sale, i.e. reduce the asking price, renovate/remodel some home features, etc. On the other hand, a potential buyer will have an idea of the available time for her to react i.e. to place an offer. In this work, we propose a supervised regression (Cox regression) model inspired by survival analysis to predict the sale probability of a house given historical home sale information within an observation time window. We use real-life housing data collected from \"trulia.com\" to validate the proposed prediction algorithm and show its superior performance over traditional regression methods. We also show how the sale probability of a house is influenced by the values of basic house features, such as price, size, # of bedrooms, # of bathrooms, and school quality.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132590488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A Decision Tree-Based Approach for Categorizing Spatial Database Query Results 基于决策树的空间数据库查询结果分类方法
Xiangfu Meng, Xiaoyan Zhang, Jinguang Sun, Lin Li, Changzheng Xing, Chongchun Bi
Spatial database queries are often exploratory. The users often find that their queries return too many answers and many of them may be irrelevant. Based on the coupling relationships between spatial objects, this paper proposes a novel categorization approach which consists of two steps. The first step analyzes the spatial object coupling relationship by considering the location proximity and semantic similarity between spatial objects, and then a set of clusters over the spatial objects can be generated, where each cluster represents one type of user need. When a user issues a spatial query, the second step presents to the user a category tree which is generated by using modified C4.5 decision tree algorithm over the clusters such that the user can easily select the subset of query results matching his/her needs by exploring the labels assigned on intermediate nodes of the tree. The experiments demonstrate that our spatial object clustering method can efficiently capture both the semantic and location correlations between spatial objects. The effectiveness and efficiency of the categorization algorithm is also demonstrated.
空间数据库查询通常是探索性的。用户经常发现他们的查询返回了太多的答案,其中许多可能是不相关的。基于空间对象之间的耦合关系,提出了一种新的分类方法,分为两个步骤。首先通过考虑空间对象之间的位置接近性和语义相似度,分析空间对象的耦合关系,然后在空间对象上生成一组聚类,每个聚类代表一种用户需求类型。当用户发出空间查询时,第二步向用户呈现一棵类别树,该类别树是在聚类上使用改进的C4.5决策树算法生成的,用户可以通过搜索树的中间节点上分配的标签,方便地选择符合自己需求的查询结果子集。实验表明,本文提出的空间目标聚类方法能够有效地捕获空间目标之间的语义相关性和位置相关性。最后,验证了该分类算法的有效性和高效性。
{"title":"A Decision Tree-Based Approach for Categorizing Spatial Database Query Results","authors":"Xiangfu Meng, Xiaoyan Zhang, Jinguang Sun, Lin Li, Changzheng Xing, Chongchun Bi","doi":"10.1109/DSAA.2016.50","DOIUrl":"https://doi.org/10.1109/DSAA.2016.50","url":null,"abstract":"Spatial database queries are often exploratory. The users often find that their queries return too many answers and many of them may be irrelevant. Based on the coupling relationships between spatial objects, this paper proposes a novel categorization approach which consists of two steps. The first step analyzes the spatial object coupling relationship by considering the location proximity and semantic similarity between spatial objects, and then a set of clusters over the spatial objects can be generated, where each cluster represents one type of user need. When a user issues a spatial query, the second step presents to the user a category tree which is generated by using modified C4.5 decision tree algorithm over the clusters such that the user can easily select the subset of query results matching his/her needs by exploring the labels assigned on intermediate nodes of the tree. The experiments demonstrate that our spatial object clustering method can efficiently capture both the semantic and location correlations between spatial objects. The effectiveness and efficiency of the categorization algorithm is also demonstrated.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127127376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Closest Interval Join Using MapReduce 使用MapReduce进行最近间隔连接
Qiang Zhang, Andy He, Chris Liu, Eric Lo
The closest interval join problem is to find all the closest intervals between two interval sets R and S. Applications of closest interval join include bioinformatics and other data science. Interval data can be very large and continue to increase in size due to the advancement of data acquisition technology. In this paper, we present efficient MapReduce algorithms to compute closest interval join. Experiments based on both real and synthetic interval data demonstrated that our algorithms are efficient.
最接近区间连接问题是找出两个区间集R和s之间的所有最接近区间。最接近区间连接的应用包括生物信息学和其他数据科学。区间数据可能非常大,并且由于数据采集技术的进步,数据规模还在继续增加。在本文中,我们提出了高效的MapReduce算法来计算最接近区间连接。基于真实和合成区间数据的实验表明,我们的算法是有效的。
{"title":"Closest Interval Join Using MapReduce","authors":"Qiang Zhang, Andy He, Chris Liu, Eric Lo","doi":"10.1109/DSAA.2016.39","DOIUrl":"https://doi.org/10.1109/DSAA.2016.39","url":null,"abstract":"The closest interval join problem is to find all the closest intervals between two interval sets R and S. Applications of closest interval join include bioinformatics and other data science. Interval data can be very large and continue to increase in size due to the advancement of data acquisition technology. In this paper, we present efficient MapReduce algorithms to compute closest interval join. Experiments based on both real and synthetic interval data demonstrated that our algorithms are efficient.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129780136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Label, Segment, Featurize: A Cross Domain Framework for Prediction Engineering 标签、分段、特征:预测工程的跨领域框架
James Max Kanter, O. Gillespie, K. Veeramachaneni
In this paper, we introduce "prediction engineering" as a formal step in the predictive modeling process. We define a generalizable 3 part framework — Label, Segment, Featurize (L-S-F) — to address the growing demand for predictive models. The framework provides abstractions for data scientists to customize the process to unique prediction problems. We describe how to apply the L-S-F framework to characteristic problems in 2 domains and demonstrate an implementation over 5 unique prediction problems defined on a dataset of crowdfunding projects from DonorsChoose.org. The results demonstrate how the L-S-F framework complements existing tools to allow us to rapidly build and evaluate 26 distinct predictive models. L-S-F enables development of models that provide value to all parties involved (donors, teachers, and people running the platform).
在本文中,我们引入了“预测工程”作为预测建模过程中的正式步骤。我们定义了一个可概括的三部分框架-标签,分段,特征(L-S-F) -以满足对预测模型日益增长的需求。该框架为数据科学家提供了抽象,以便针对独特的预测问题定制流程。我们描述了如何将L-S-F框架应用于2个领域的特征问题,并演示了在DonorsChoose.org众筹项目数据集上定义的5个独特预测问题的实现。结果证明了L-S-F框架如何补充现有工具,使我们能够快速构建和评估26种不同的预测模型。L-S-F使模型的开发能够为所有相关方(捐赠者、教师和运行平台的人)提供价值。
{"title":"Label, Segment, Featurize: A Cross Domain Framework for Prediction Engineering","authors":"James Max Kanter, O. Gillespie, K. Veeramachaneni","doi":"10.1109/DSAA.2016.54","DOIUrl":"https://doi.org/10.1109/DSAA.2016.54","url":null,"abstract":"In this paper, we introduce \"prediction engineering\" as a formal step in the predictive modeling process. We define a generalizable 3 part framework — Label, Segment, Featurize (L-S-F) — to address the growing demand for predictive models. The framework provides abstractions for data scientists to customize the process to unique prediction problems. We describe how to apply the L-S-F framework to characteristic problems in 2 domains and demonstrate an implementation over 5 unique prediction problems defined on a dataset of crowdfunding projects from DonorsChoose.org. The results demonstrate how the L-S-F framework complements existing tools to allow us to rapidly build and evaluate 26 distinct predictive models. L-S-F enables development of models that provide value to all parties involved (donors, teachers, and people running the platform).","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121389870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
The Semantic Knowledge Graph: A Compact, Auto-Generated Model for Real-Time Traversal and Ranking of any Relationship within a Domain 语义知识图:一个紧凑的,自动生成的模型,用于实时遍历和排序领域内的任何关系
Trey Grainger, Khalifeh AlJadda, M. Korayem, Andries Smith
This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain. The source code for our Semantic Knowledge Graph implementation is being published along with this paper to facilitate further research and extensions of this work.
本文描述了一种新的知识表示和挖掘系统,我们称之为语义知识图。语义知识图的核心是利用一个倒排索引,以及一个互补的非倒排索引来表示节点(术语)和边缘(多个术语/节点的交叉发布列表中的文档)。这在每对节点及其相应的边缘之间提供了一个间接层,使边缘能够从底层语料库统计动态地具体化。因此,任何节点的组合都可以有任何其他节点的边,并可以被评分以揭示节点之间的潜在关系。这提供了许多好处:知识图可以从现实世界的数据语料库中自动构建,新节点及其组合边可以从预先存在的节点的任意组合中立即物化(使用集合操作),并且可以使用高度紧凑的图表示来表示和动态遍历域内所有实体之间的语义关系的完整模型。这样的系统在知识建模和推理、自然语言处理、异常检测、数据清理、语义搜索、分析、数据分类、根本原因分析和推荐系统等领域有着广泛的应用。本文的主要贡献是引入了一个新的系统——语义知识图——它能够动态地发现和评分任何任意实体(词、短语或提取的概念)组合之间的有趣关系,通过动态地物化节点和边缘,从一个知识领域的数据代表语料库自动构建的紧凑图形表示。我们的语义知识图实现的源代码与本文一起发布,以促进这项工作的进一步研究和扩展。
{"title":"The Semantic Knowledge Graph: A Compact, Auto-Generated Model for Real-Time Traversal and Ranking of any Relationship within a Domain","authors":"Trey Grainger, Khalifeh AlJadda, M. Korayem, Andries Smith","doi":"10.1109/DSAA.2016.51","DOIUrl":"https://doi.org/10.1109/DSAA.2016.51","url":null,"abstract":"This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain. The source code for our Semantic Knowledge Graph implementation is being published along with this paper to facilitate further research and extensions of this work.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126771791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Learning Temporal Dependence from Time-Series Data with Latent Variables 从具有潜在变量的时间序列数据中学习时间依赖性
Hossein Hosseini, Sreeram Kannan, Baosen Zhang, R. Poovendran
We consider the setting where a collection of time series, modeled as random processes, evolve in a causal manner, and one is interested in learning the graph governing the relationships of these processes. A special case of wide interest and applicability is the setting where the noise is Gaussian and relationships are Markov and linear. We study this setting with two additional features: firstly, each random process has a hidden (latent) state, which we use to model the internal memory possessed by the variables (similar to hidden Markov models). Secondly, each variable can depend on its latent memory state through a random lag (rather than a fixed lag), thus modeling memory recall with differing lags at distinct times. Under this setting, we develop an estimator and prove that under a genericity assumption, the parameters of the model can be learned consistently. We also propose a practical adaption of this estimator, which demonstrates significant performance gains in both synthetic and real-world datasets.
我们考虑这样的设置:一组时间序列,建模为随机过程,以因果方式进化,并且有兴趣学习控制这些过程关系的图。一个广泛关注和适用的特殊情况是噪声是高斯的,关系是马尔可夫和线性的。我们用两个额外的特征来研究这个设置:首先,每个随机过程都有一个隐藏(潜在)状态,我们用它来建模变量拥有的内部内存(类似于隐马尔可夫模型)。其次,每个变量可以通过随机滞后(而不是固定滞后)依赖于它的潜在记忆状态,从而在不同的时间用不同的滞后来建模记忆回忆。在此设置下,我们建立了一个估计量,并证明了在一般假设下,模型的参数是可以一致学习的。我们还提出了该估计器的实际应用,它在合成和真实数据集中都显示了显着的性能提升。
{"title":"Learning Temporal Dependence from Time-Series Data with Latent Variables","authors":"Hossein Hosseini, Sreeram Kannan, Baosen Zhang, R. Poovendran","doi":"10.1109/DSAA.2016.34","DOIUrl":"https://doi.org/10.1109/DSAA.2016.34","url":null,"abstract":"We consider the setting where a collection of time series, modeled as random processes, evolve in a causal manner, and one is interested in learning the graph governing the relationships of these processes. A special case of wide interest and applicability is the setting where the noise is Gaussian and relationships are Markov and linear. We study this setting with two additional features: firstly, each random process has a hidden (latent) state, which we use to model the internal memory possessed by the variables (similar to hidden Markov models). Secondly, each variable can depend on its latent memory state through a random lag (rather than a fixed lag), thus modeling memory recall with differing lags at distinct times. Under this setting, we develop an estimator and prove that under a genericity assumption, the parameters of the model can be learned consistently. We also propose a practical adaption of this estimator, which demonstrates significant performance gains in both synthetic and real-world datasets.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129544363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Limiting the Diffusion of Information by a Selective PageRank-Preserving Approach 限制信息扩散的选择性PageRank-Preserving方法
G. Loukides, Robert Gwadera
The problem of limiting the diffusion of information in social networks has received substantial attention. To deal with the problem, existing works aim to prevent the diffusion of information to as many nodes as possible, by deleting a given number of edges. Thus, they assume that the diffusing information can affect all nodes and that the deletion of each edge has the same impact on the information propagation properties of the graph. In this work, we propose an approach which lifts these limiting assumptions. Our approach allows specifying the nodes to which information diffusion should be prevented and their maximum allowable activation probability, and it performs edge deletion while avoiding drastic changes to the ability of the network to propagate information. To realize our approach, we propose a measure that captures changes, caused by deletion, to the PageRank distribution of the graph. Based on the measure, we define the problem of finding an edge subset to delete as an optimization problem. We show that the problem can be modeled as a Submodular Set Cover (SSC) problem and design an approximation algorithm, based on the well-known approximation algorithm for SSC. In addition, we develop an iterative heuristic that has similar effectiveness but is significantly more efficient than our algorithm. Experiments on real and synthetic data show the effectiveness and efficiency of our methods.
限制信息在社交网络中的传播的问题已经受到了广泛的关注。为了解决这个问题,现有的工作旨在通过删除给定数量的边来防止信息扩散到尽可能多的节点。因此,他们假设扩散信息可以影响所有节点,并且每个边的删除对图的信息传播属性具有相同的影响。在这项工作中,我们提出了一种解除这些限制假设的方法。我们的方法允许指定应该阻止信息扩散的节点及其最大允许激活概率,并且在避免网络传播信息能力发生剧烈变化的同时执行边缘删除。为了实现我们的方法,我们提出了一种度量方法,可以捕获由于删除而引起的对图的PageRank分布的变化。在此基础上,我们将寻找要删除的边子集的问题定义为优化问题。我们证明了该问题可以建模为一个子模集覆盖(SSC)问题,并基于众所周知的SSC近似算法设计了一个近似算法。此外,我们开发了一种迭代启发式算法,它具有类似的有效性,但比我们的算法效率高得多。在实际数据和合成数据上的实验表明了该方法的有效性和高效性。
{"title":"Limiting the Diffusion of Information by a Selective PageRank-Preserving Approach","authors":"G. Loukides, Robert Gwadera","doi":"10.1109/DSAA.2016.16","DOIUrl":"https://doi.org/10.1109/DSAA.2016.16","url":null,"abstract":"The problem of limiting the diffusion of information in social networks has received substantial attention. To deal with the problem, existing works aim to prevent the diffusion of information to as many nodes as possible, by deleting a given number of edges. Thus, they assume that the diffusing information can affect all nodes and that the deletion of each edge has the same impact on the information propagation properties of the graph. In this work, we propose an approach which lifts these limiting assumptions. Our approach allows specifying the nodes to which information diffusion should be prevented and their maximum allowable activation probability, and it performs edge deletion while avoiding drastic changes to the ability of the network to propagate information. To realize our approach, we propose a measure that captures changes, caused by deletion, to the PageRank distribution of the graph. Based on the measure, we define the problem of finding an edge subset to delete as an optimization problem. We show that the problem can be modeled as a Submodular Set Cover (SSC) problem and design an approximation algorithm, based on the well-known approximation algorithm for SSC. In addition, we develop an iterative heuristic that has similar effectiveness but is significantly more efficient than our algorithm. Experiments on real and synthetic data show the effectiveness and efficiency of our methods.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127224401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1