2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)最新文献_第5页

Nonparametric Adjoint-Based Inference for Stochastic Differential Equations 随机微分方程的非参数伴随推理

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.69

H. Bhat, R. Madushani

We develop a nonparametric method to infer the drift and diffusion functions of a stochastic differential equation. With this method, we can build predictive models starting with repeated time series and/or high-dimensional longitudinal data. Typical use of the method includes forecasting the future density or distribution of the variable being measured. The key innovation in our method stems from efficient algorithms to evaluate a likelihood function and its gradient. These algorithms do not rely on sampling, instead, they use repeated quadrature and the adjoint method to enable the inference to scale well as the dimensionality of the parameter vector grows. In simulated data tests, when the number of sample paths is large, the method does an excellent job of inferring drift functions close to the ground truth. We show that even when the method does not infer the drift function correctly, it still yields models with good predictive power. Finally, we apply the method to real data on hourly measurements of ground level ozone, showing that it is capable of reasonable results.

本文提出了一种非参数方法来推导随机微分方程的漂移函数和扩散函数。使用这种方法，我们可以从重复的时间序列和/或高维纵向数据开始构建预测模型。该方法的典型用途包括预测被测变量的未来密度或分布。该方法的关键创新源于评估似然函数及其梯度的有效算法。这些算法不依赖于采样，而是使用重复正交和伴随方法，使推理能够随着参数向量维数的增长而扩展。在模拟数据测试中，当样本路径数量较大时，该方法可以很好地推断出接近地面真实值的漂移函数。我们表明，即使该方法不能正确地推断漂移函数，它仍然产生具有良好预测能力的模型。最后，将该方法应用于每小时地面臭氧测量的实际数据，结果表明该方法能够得到合理的结果。

引用次数: 7

Performance Improvement of MapReduce Process by Promoting Deep Data Locality 基于深度数据局部性的MapReduce进程性能改进

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.38

Sungchul Lee, Ju-Yeon Jo, Yoohwan Kim

MapReduce has been widely used in many data science applications. It has been observed that an excessive data transfer has a negative impact on its performance. To reduce the amount of data transfer, MapReduce utilizes data locality. However, even though the majority of the processing cost occurs in the later stages, data locality has been utilized only in the early stages, which we call Shallow Data Locality (SDL). As a result, the benefit of data locality has not been fully realized. We have explored a new concept called Deep Data Locality (DDL) where the data is pre-arranged to maximize the locality in the later stages. Toward achieving stronger DDL, we introduce a new block placement paradigm called Limited Node Block Placement Policy (LNBPP). Under the conventional default block placement policy (DBPP), data blocks are randomly placed on any available slave nodes, requiring a copy of RLM (Rack-Local Map) blocks. On the other hand, LNBPP places the blocks in a way to avoid RLMs, reducing the block copying time. The containers without RLM have a more consistent execution time, and when assigned to individual cores on a multicore node, they finish a job faster collectively than the containers under DBPP. LNBPP also rearranges the blocks into a smaller number of nodes (hence Limited Node) and reduces the data transfer time between nodes. These strategies bring a significant performance improvement in Map and Shuffle. Our test result shows that the execution times of Map and Shuffle have been improved by up to 33% and 44% respectively. In this paper, we describe the MapReduce workflow in Hadoop with a simple computational model and introduce the current research directions in each step. We analyze the block placement status and RLM locations in DBPP with the customer review data from TripAdvisor and measure the performances by executing the Terasort Benchmark with various sizes of data. We then compare the performances of LNBPP with DBPP.

MapReduce已经广泛应用于许多数据科学应用中。据观察，过度的数据传输会对其性能产生负面影响。为了减少数据传输量，MapReduce利用了数据的局域性。然而，即使大部分处理成本发生在后期阶段，数据局部性只在早期阶段被利用，我们称之为浅数据局部性(SDL)。因此，数据局部性的好处没有得到充分发挥。我们探索了一个名为深度数据局部性(DDL)的新概念，其中数据被预先安排以在后期阶段最大化局部性。为了实现更强的DDL，我们引入了一种新的块放置范式，称为有限节点块放置策略(LNBPP)。在传统的默认块放置策略(DBPP)下，数据块被随机放置在任何可用的从节点上，这需要RLM (Rack-Local Map)块的副本。另一方面，LNBPP以一种避免rlm的方式放置块，减少了块复制时间。没有RLM的容器具有更一致的执行时间，并且当将它们分配给多核节点上的单个内核时，它们比DBPP下的容器更快地完成任务。LNBPP还将区块重新排列成更少的节点(Limited Node)，减少了节点之间的数据传输时间。这些策略显著提高了Map和Shuffle的性能。我们的测试结果表明，Map和Shuffle的执行时间分别提高了33%和44%。本文用一个简单的计算模型描述了Hadoop中MapReduce的工作流程，并在每个步骤中介绍了当前的研究方向。我们使用TripAdvisor的客户评论数据分析了DBPP中的块放置状态和RLM位置，并通过使用不同大小的数据执行Terasort Benchmark来衡量性能。然后我们比较了LNBPP和DBPP的性能。

{"title":"Performance Improvement of MapReduce Process by Promoting Deep Data Locality","authors":"Sungchul Lee, Ju-Yeon Jo, Yoohwan Kim","doi":"10.1109/DSAA.2016.38","DOIUrl":"https://doi.org/10.1109/DSAA.2016.38","url":null,"abstract":"MapReduce has been widely used in many data science applications. It has been observed that an excessive data transfer has a negative impact on its performance. To reduce the amount of data transfer, MapReduce utilizes data locality. However, even though the majority of the processing cost occurs in the later stages, data locality has been utilized only in the early stages, which we call Shallow Data Locality (SDL). As a result, the benefit of data locality has not been fully realized. We have explored a new concept called Deep Data Locality (DDL) where the data is pre-arranged to maximize the locality in the later stages. Toward achieving stronger DDL, we introduce a new block placement paradigm called Limited Node Block Placement Policy (LNBPP). Under the conventional default block placement policy (DBPP), data blocks are randomly placed on any available slave nodes, requiring a copy of RLM (Rack-Local Map) blocks. On the other hand, LNBPP places the blocks in a way to avoid RLMs, reducing the block copying time. The containers without RLM have a more consistent execution time, and when assigned to individual cores on a multicore node, they finish a job faster collectively than the containers under DBPP. LNBPP also rearranges the blocks into a smaller number of nodes (hence Limited Node) and reduces the data transfer time between nodes. These strategies bring a significant performance improvement in Map and Shuffle. Our test result shows that the execution times of Map and Shuffle have been improved by up to 33% and 44% respectively. In this paper, we describe the MapReduce workflow in Hadoop with a simple computational model and introduce the current research directions in each step. We analyze the block placement status and RLM locations in DBPP with the customer review data from TripAdvisor and measure the performances by executing the Terasort Benchmark with various sizes of data. We then compare the performances of LNBPP with DBPP.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114623366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Word Segmentation Algorithms with Lexical Resources for Hashtag Classification 基于词汇资源的标签分类分词算法

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.80

Credell Simeon, Howard J. Hamilton, Robert J. Hilderman

We present a novel method for classifying hashtag types. Specifically, we apply word segmentation algorithms and lexical resources in order to classify two types of hashtags: those with sentiment information and those without. However, the complex structure of hashtags increases the difficulty of identifying sentiment information. In order to solve this problem, we segment hashtags into smaller semantic units using word segmentation algorithms in conjunction with lexical resources to classify hashtag types. Our experimental results demonstrate that our approach achieves a 14% increase in accuracy over baseline methods for identifying hashtags with sentiment information. Additionally, we achieve over 94% recall using this hashtag type for the subjectivity detection of tweets.

提出了一种新的标签类型分类方法。具体来说，我们应用分词算法和词汇资源来对两种类型的标签进行分类:有情感信息的标签和没有情感信息的标签。然而，标签复杂的结构增加了情感信息识别的难度。为了解决这个问题，我们使用分词算法结合词汇资源将标签划分为更小的语义单元，对标签类型进行分类。我们的实验结果表明，在识别带有情感信息的标签时，我们的方法比基线方法的准确率提高了14%。此外，使用这种标签类型进行推文主观性检测，我们实现了超过94%的召回率。

引用次数: 8

Using Loglinear Model for Discrimination Discovery and Prevention 基于对数线性模型的歧视发现与预防

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.18

Yongkai Wu, Xintao Wu

Discrimination discovery and prevention has received intensive attention recently. Discrimination generally refers to an unjustified distinction of individuals based on their membership, or perceived membership, in a certain group, and often occurs when the group is treated less favorably than others. However, existing discrimination discovery and prevention approaches are often limited to examining the relationship between one decision attribute and one protected attribute and do not sufficiently incorporate the effects due to other non-protected attributes. In this paper we develop a single unifying framework that aims to capture and measure discriminations between multiple decision attributes and protected attributes in addition to a set of non-protected attributes. Our approach is based on loglinear modeling. The coefficient values of the fitted loglinear model provide quantitative evidence of discrimination in decision making. The conditional independence graph derived from the fitted graphical loglinear model can be effectively used to capture the existence of discrimination patterns based on Markov properties. We further develop an algorithm to remove discrimination. The idea is modifying those significant coefficients from the fitted loglinear model and using the modified model to generate new data. Our empirical evaluation results show effectiveness of our proposed approach.

歧视的发现和预防是近年来备受关注的问题。歧视通常是指基于个人在某一群体中的成员身份或被认为是成员身份而对个人进行的不合理的区分，通常发生在该群体受到的待遇不如其他群体时。然而，现有的歧视发现和预防方法往往局限于检查一个决策属性和一个受保护属性之间的关系，而没有充分考虑其他非受保护属性的影响。在本文中，我们开发了一个单一的统一框架，旨在捕获和测量多个决策属性和受保护属性以及一组非受保护属性之间的区别。我们的方法是基于对数线性建模。拟合的对数模型的系数值为决策中的歧视提供了定量证据。由拟合的图形对数线性模型导出的条件独立图可以有效地捕捉基于马尔可夫性质的判别模式的存在性。我们进一步开发了一种算法来消除歧视。其思想是从拟合的对数线性模型中修改那些显著系数，并使用修改后的模型生成新数据。实证分析结果表明，本文提出的方法是有效的。

{"title":"Using Loglinear Model for Discrimination Discovery and Prevention","authors":"Yongkai Wu, Xintao Wu","doi":"10.1109/DSAA.2016.18","DOIUrl":"https://doi.org/10.1109/DSAA.2016.18","url":null,"abstract":"Discrimination discovery and prevention has received intensive attention recently. Discrimination generally refers to an unjustified distinction of individuals based on their membership, or perceived membership, in a certain group, and often occurs when the group is treated less favorably than others. However, existing discrimination discovery and prevention approaches are often limited to examining the relationship between one decision attribute and one protected attribute and do not sufficiently incorporate the effects due to other non-protected attributes. In this paper we develop a single unifying framework that aims to capture and measure discriminations between multiple decision attributes and protected attributes in addition to a set of non-protected attributes. Our approach is based on loglinear modeling. The coefficient values of the fitted loglinear model provide quantitative evidence of discrimination in decision making. The conditional independence graph derived from the fitted graphical loglinear model can be effectively used to capture the existence of discrimination patterns based on Markov properties. We further develop an algorithm to remove discrimination. The idea is modifying those significant coefficients from the fitted loglinear model and using the modified model to generate new data. Our empirical evaluation results show effectiveness of our proposed approach.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131884127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Uncovering the Bitcoin Blockchain: An Analysis of the Full Users Graph 揭示比特币区块链:全用户图分析

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.52

D. Maesa, Andrea Marino, L. Ricci

BITCOIN is a novel decentralized cryptocurrency system which has recently received a great attention from a wider audience. An interesting and unique feature of this system is that the complete list of all the transactions occurred from its inception is publicly available. This enables the investigation of funds movements to uncover interesting properties of the BITCOIN economy. In this paper we present a set of analyses of the user graph, i.e. the graph obtained by an heuristic clustering of the graph of BITCOIN transactions. Our analyses consider an up-to-date BITCOIN blockchain, as in December 2015, after the exponential explosion of the number of transactions occurred in the last two years. The set of analyses we defined includes, among others, the analysis of the time evolution of BITCOIN network, the verification of the "rich get richer" conjecture and the detection of the nodes which are critical for the network connectivity.

比特币是一种新型的去中心化加密货币系统，最近受到了更广泛的关注。这个系统的一个有趣和独特的特性是，从它开始发生的所有事务的完整列表是公开的。这使得对资金流动的调查能够揭示比特币经济的有趣属性。在本文中，我们提出了一组用户图的分析，即通过对比特币交易图的启发式聚类得到的图。我们的分析考虑了最新的比特币区块链，就像2015年12月一样，在过去两年交易数量呈指数级增长之后。我们定义的分析集包括比特币网络的时间演化分析，验证“富得更富”猜想以及检测对网络连接至关重要的节点。

引用次数: 74

Perceived, Projected, and True Investment Expertise: Not All Experts Provide Expert Recommendations 感知、预测和真实的投资专业知识:并非所有专家都提供专家建议

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.45

Amit Shavit, Sameena Shah

Social networks enable knowledge sharing that inevitably begs the question of expertise analysis. Many online profiles claim expertise, but possessing true expertise is rare. We characterize expertise as projected expertise (claims of a person), perceived expertise (how the crowd perceives the individual) and true expertise (factual). StockTwits, an investor-focused microblogging platform, allows us to study all three aspects simultaneously. We analyze more than 18 million tweets spanning 1700 days. The large time scale allows us to also analyze expertise and its categories as they evolve over time, which is the first study of its kind on StockTwits. We propose a method to capture perceived expertise by how significantly a user's follower network grows and how often the user is brought up in conversations. We also quantify actual, market-based, true expertise based on the user's trade and investment recommendations. Finally we provide an analysis bringing out the differences between how users project themselves, how the crowd perceives them, and how they are actually performing on the market. Our results show that users who project themselves as experts are ones that talk the most and provide the least recommendation-to-tweet ratio (that is, most of their conversations are mundane). The recommendations from users who project novice expertise slightly outperform (≈5%) the overall stock market. On the other hand, the trade recommendations from self-proclaimed experts yield 80% less than those of intermediate traders. Interestingly, users who are perceived as experts by others, as measured by centrality measurements, resulted in net negative returns after a four year trading period. Our study also looks at the evolution of expertise, and begins to understand why and what makes users change the way they project their own expertise. For this topic, however, this paper introduces more questions than it answers, which will serve as the basis for future studies.

社交网络使知识共享成为可能，这不可避免地回避了专业知识分析的问题。很多网上的个人资料都声称自己有专业知识，但拥有真正专业知识的人却很少。我们将专业知识描述为预测专业知识(一个人的主张)，感知专业知识(人群如何看待个人)和真实专业知识(事实)。StockTwits是一个专注于投资者的微博平台，它让我们可以同时研究这三个方面。我们分析了1700天内超过1800万条推文。大的时间尺度允许我们分析专业知识及其类别，因为它们随着时间的推移而演变，这是StockTwits上的第一个此类研究。我们提出了一种方法，通过用户的追随者网络增长的显著程度和用户在对话中被提起的频率来捕捉感知专业知识。我们还根据用户的贸易和投资建议，量化实际的、基于市场的、真正的专业知识。最后，我们提供了一个分析，揭示了用户如何展示自己，人群如何感知他们，以及他们在市场上的实际表现之间的差异。我们的研究结果显示，把自己标榜为专家的用户是那些说话最多、推荐与推文比率最低的用户(也就是说，他们的大多数对话都是平淡无奇的)。来自用户的建议，谁投射新手专业知识略优于(≈5%)整体股票市场。另一方面，自称专家的交易建议的收益比中级交易员少80%。有趣的是，通过中心性测量，被其他人视为专家的用户在四年的交易期后产生了净负回报。我们的研究还着眼于专业知识的演变，并开始理解为什么以及是什么让用户改变了他们投射自己专业知识的方式。然而，对于这个课题，本文提出的问题多于回答的问题，这将为今后的研究奠定基础。

{"title":"Perceived, Projected, and True Investment Expertise: Not All Experts Provide Expert Recommendations","authors":"Amit Shavit, Sameena Shah","doi":"10.1109/DSAA.2016.45","DOIUrl":"https://doi.org/10.1109/DSAA.2016.45","url":null,"abstract":"Social networks enable knowledge sharing that inevitably begs the question of expertise analysis. Many online profiles claim expertise, but possessing true expertise is rare. We characterize expertise as projected expertise (claims of a person), perceived expertise (how the crowd perceives the individual) and true expertise (factual). StockTwits, an investor-focused microblogging platform, allows us to study all three aspects simultaneously. We analyze more than 18 million tweets spanning 1700 days. The large time scale allows us to also analyze expertise and its categories as they evolve over time, which is the first study of its kind on StockTwits. We propose a method to capture perceived expertise by how significantly a user's follower network grows and how often the user is brought up in conversations. We also quantify actual, market-based, true expertise based on the user's trade and investment recommendations. Finally we provide an analysis bringing out the differences between how users project themselves, how the crowd perceives them, and how they are actually performing on the market. Our results show that users who project themselves as experts are ones that talk the most and provide the least recommendation-to-tweet ratio (that is, most of their conversations are mundane). The recommendations from users who project novice expertise slightly outperform (≈5%) the overall stock market. On the other hand, the trade recommendations from self-proclaimed experts yield 80% less than those of intermediate traders. Interestingly, users who are perceived as experts by others, as measured by centrality measurements, resulted in net negative returns after a four year trading period. Our study also looks at the evolution of expertise, and begins to understand why and what makes users change the way they project their own expertise. For this topic, however, this paper introduces more questions than it answers, which will serve as the basis for future studies.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129895733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Digital Knowledge Ecosystem for Achieving Sustainable Agriculture Production: A Case Study from Sri Lanka 实现可持续农业生产的数字知识生态系统:斯里兰卡的案例研究

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.82

A. Ginige, A. Walisadeera, T. Ginige, Lasanthi N. C. De Silva, P. D. Giovanni, M. Mathai, J. Goonetillake, G. Wikramanayake, G. Vitiello, M. Sebillo, G. Tortora, Deborah Richards, R. Jain

Crop production problems are common in Sri Lanka which severely effect rural farmers, agriculture sector and the country's economy as a whole. A deeper analysis revealed that the root cause was farmers and other stakeholders in the domain not receiving right information at the right time in the right format. Inspired by the rapid growth of mobile phone usage among farmers a mobile-based solution is sought to overcome this information gap. Farmers needed published information (quasi static) about crops, pests, diseases, land preparation, growing and harvesting methods and real-time situational information (dynamic) such as current crop production and market prices. This situational information is also needed by agriculture department, agro-chemical companies, buyers and various government agencies to ensure food security through effective supply chain planning whilst minimising waste. We developed a notion of context specific actionable information which enables user to act with least amount of further processing. User centered agriculture ontology was developed to convert published quasi static information to actionable information. We adopted empowerment theory to create empowerment-oriented farming processes to motivate farmers to act on this information and aggregated the transaction data to generate situational information. This created a holistic information flow model for agriculture domain similar to energy flow in biological ecosystems. Consequently, the initial Mobile-based Information System evolved into a Digital Knowledge Ecosystem that can predict current production situation in near real enabling government agencies to dynamically adjust the incentives offered to farmers for growing different types of crops to achieve sustainable agriculture production through crop diversification.

作物生产问题在斯里兰卡很普遍，严重影响到农村农民、农业部门和整个国家的经济。更深入的分析表明，根本原因是农民和该领域的其他利益相关者没有在正确的时间以正确的格式接收到正确的信息。受到农民中移动电话使用迅速增长的启发，寻求一种基于移动的解决方案来克服这一信息差距。农民需要关于作物、病虫害、土地整理、种植和收获方法的公开信息(准静态)和实时情景信息(动态)，如当前作物产量和市场价格。农业部门、农用化学品公司、买家和各种政府机构也需要这些情景信息，以通过有效的供应链规划来确保食品安全，同时最大限度地减少浪费。我们开发了一个特定于上下文的可操作信息的概念，它使用户能够在最少的进一步处理的情况下进行操作。开发了以用户为中心的农业本体，将已发布的准静态信息转换为可操作信息。我们采用授权理论来创建以授权为导向的农业流程，以激励农民根据这些信息采取行动，并汇总交易数据以生成情境信息。这创建了一个类似于生物生态系统能量流的农业领域整体信息流模型。因此，最初的基于移动的信息系统演变成一个数字知识生态系统，可以近乎真实地预测当前的生产状况，使政府机构能够动态调整提供给农民的激励措施，以种植不同类型的作物，通过作物多样化实现可持续的农业生产。

{"title":"Digital Knowledge Ecosystem for Achieving Sustainable Agriculture Production: A Case Study from Sri Lanka","authors":"A. Ginige, A. Walisadeera, T. Ginige, Lasanthi N. C. De Silva, P. D. Giovanni, M. Mathai, J. Goonetillake, G. Wikramanayake, G. Vitiello, M. Sebillo, G. Tortora, Deborah Richards, R. Jain","doi":"10.1109/DSAA.2016.82","DOIUrl":"https://doi.org/10.1109/DSAA.2016.82","url":null,"abstract":"Crop production problems are common in Sri Lanka which severely effect rural farmers, agriculture sector and the country's economy as a whole. A deeper analysis revealed that the root cause was farmers and other stakeholders in the domain not receiving right information at the right time in the right format. Inspired by the rapid growth of mobile phone usage among farmers a mobile-based solution is sought to overcome this information gap. Farmers needed published information (quasi static) about crops, pests, diseases, land preparation, growing and harvesting methods and real-time situational information (dynamic) such as current crop production and market prices. This situational information is also needed by agriculture department, agro-chemical companies, buyers and various government agencies to ensure food security through effective supply chain planning whilst minimising waste. We developed a notion of context specific actionable information which enables user to act with least amount of further processing. User centered agriculture ontology was developed to convert published quasi static information to actionable information. We adopted empowerment theory to create empowerment-oriented farming processes to motivate farmers to act on this information and aggregated the transaction data to generate situational information. This created a holistic information flow model for agriculture domain similar to energy flow in biological ecosystems. Consequently, the initial Mobile-based Information System evolved into a Digital Knowledge Ecosystem that can predict current production situation in near real enabling government agencies to dynamically adjust the incentives offered to farmers for growing different types of crops to achieve sustainable agriculture production through crop diversification.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126965436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

EM*: An EM Algorithm for Big Data EM*:面向大数据的EM算法

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.40

H. Kurban, Mark Jenne, Mehmet M. Dalkilic

Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like Expectation Maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a non-linear hierarchical data structure(heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real world and synthetic data sets. We lastly conclude with some theoretic underpinnings that explain why EM* is successful.

现有的数据挖掘技术，尤其是迭代学习算法，已经被大数据淹没了。虽然并行是一个明显的，通常是必要的策略，但我们观察到(1)不断重访数据和(2)访问所有数据是两个最突出的问题，特别是对于迭代，无监督算法，如聚类期望最大化算法(EM-T)。我们的策略是将EM-T嵌入到非线性分层数据结构(堆)中，使我们能够(1)将需要重新访问的数据与不需要重新访问的数据分开;(2)将迭代范围缩小到更难以聚类的数据。我们称之为扩展的EM- t, EM*。我们的EM*算法在大型真实世界和合成数据集上优于EM- t算法。最后，我们总结了一些理论基础来解释为什么EM*是成功的。

引用次数: 3

Actitracker: A Smartphone-Based Activity Recognition System for Improving Health and Well-Being Actitracker:一种基于智能手机的活动识别系统，用于改善健康和福祉

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.89

Gary M. Weiss, J. W. Lockhart, T. Pulickal, Paul T. McHugh, Isaac H. Ronan, Jessica L. Timko

Actitracker is a smartphone-based activity-monitoring service to help people ensure they receive sufficient activity to maintain proper health. This free service allowed people to set personal activity goals and monitor their progress toward these goals. Actitracker uses machine learning methods to recognize a user's activities. It initially employs a "universal" model generated from labeled activity data from a panel of users, but will automatically shift to a much more accurate personalized model once a user completes a simple training phase. Detailed activity reports and statistics are maintained and provided to the user. Actitracker is a research-based system that began in 2011, before fitness trackers like Fitbit were popular, and was deployed for public use from 2012 until 2015, during which period it had 1,000 registered users. This paper describes the Actitracker system, its use of machine learning, and user experiences. While activity recognition has now entered the mainstream, this paper provides insights into applied activity recognition, something that commercial companies rarely share.

Actitracker是一项基于智能手机的活动监测服务，旨在帮助人们确保他们获得足够的活动来保持适当的健康。这项免费服务允许人们设定个人活动目标，并监控他们实现这些目标的进展。Actitracker使用机器学习方法来识别用户的活动。它最初采用的是一种“通用”模型，该模型是从一组用户的标记活动数据中生成的，但一旦用户完成一个简单的训练阶段，它就会自动转换为更准确的个性化模型。详细的活动报告和统计数据被保存并提供给用户。Actitracker是一个基于研究的系统，始于2011年，在Fitbit等健身追踪器流行之前，并于2012年至2015年投入公共使用，在此期间拥有1000名注册用户。本文介绍了Actitracker系统，它的机器学习和用户体验。虽然活动识别现在已经进入主流，但本文提供了对应用活动识别的见解，这是商业公司很少分享的。

{"title":"Actitracker: A Smartphone-Based Activity Recognition System for Improving Health and Well-Being","authors":"Gary M. Weiss, J. W. Lockhart, T. Pulickal, Paul T. McHugh, Isaac H. Ronan, Jessica L. Timko","doi":"10.1109/DSAA.2016.89","DOIUrl":"https://doi.org/10.1109/DSAA.2016.89","url":null,"abstract":"Actitracker is a smartphone-based activity-monitoring service to help people ensure they receive sufficient activity to maintain proper health. This free service allowed people to set personal activity goals and monitor their progress toward these goals. Actitracker uses machine learning methods to recognize a user's activities. It initially employs a \"universal\" model generated from labeled activity data from a panel of users, but will automatically shift to a much more accurate personalized model once a user completes a simple training phase. Detailed activity reports and statistics are maintained and provided to the user. Actitracker is a research-based system that began in 2011, before fitness trackers like Fitbit were popular, and was deployed for public use from 2012 until 2015, during which period it had 1,000 registered users. This paper describes the Actitracker system, its use of machine learning, and user experiences. While activity recognition has now entered the mainstream, this paper provides insights into applied activity recognition, something that commercial companies rarely share.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132830935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation 在线实验诊断和故障排除超出AA验证

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.61

Zhenyu Zhao, Miao Chen, Don Matheson, M. Stone

Online experiments are frequently used at internet companies to evaluate the impact of new designs, features, or code changes on user behavior. Though the experiment design is straightforward in theory, in practice, there are many problems that can complicate the interpretation of results and render any conclusions about changes in user behavior invalid. Many of these problems are difficult to detect and often go unnoticed. Acknowledging and diagnosing these issues can prevent experiment owners from making decisions based on fundamentally flawed data. When conducting online experiments, data quality assurance is a top priority before attributing the impact to changes in user behavior. While some problems can be detected by running AA tests before introducing the treatment, many problems do not emerge during the AA period, and appear only during the AB period. Prior work on this topic has not addressed troubleshooting during the AB period. In this paper, we present lessons learned from experiments on various internet consumer products at Yahoo, as well as diagnostic and remedy procedures. Most of the examples and troubleshooting procedures presented here are generic to online experimentation at other companies. Some, such as traffic splitting problems and outlier problems have been documented before, but others have not previously been described in the literature.

互联网公司经常使用在线实验来评估新设计、功能或代码更改对用户行为的影响。虽然实验设计在理论上是简单明了的，但在实践中，有许多问题会使结果的解释复杂化，并使任何关于用户行为变化的结论无效。这些问题中有许多很难发现，而且经常被忽视。承认和诊断这些问题可以防止实验所有者根据根本有缺陷的数据做出决定。在进行在线实验时，在将影响归因于用户行为的变化之前，数据质量保证是重中之重。虽然一些问题可以通过在引入治疗之前进行AA测试来检测，但许多问题在AA期不会出现，而只在AB期出现。在此主题之前的工作没有解决AB期间的故障排除问题。在本文中，我们介绍了从雅虎各种互联网消费产品的实验中获得的经验教训，以及诊断和补救程序。这里提供的大多数示例和故障排除过程都适用于其他公司的在线实验。一些问题，如交通分流问题和离群值问题，以前已经有文献记载，但其他问题以前没有在文献中描述。

{"title":"Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation","authors":"Zhenyu Zhao, Miao Chen, Don Matheson, M. Stone","doi":"10.1109/DSAA.2016.61","DOIUrl":"https://doi.org/10.1109/DSAA.2016.61","url":null,"abstract":"Online experiments are frequently used at internet companies to evaluate the impact of new designs, features, or code changes on user behavior. Though the experiment design is straightforward in theory, in practice, there are many problems that can complicate the interpretation of results and render any conclusions about changes in user behavior invalid. Many of these problems are difficult to detect and often go unnoticed. Acknowledging and diagnosing these issues can prevent experiment owners from making decisions based on fundamentally flawed data. When conducting online experiments, data quality assurance is a top priority before attributing the impact to changes in user behavior. While some problems can be detected by running AA tests before introducing the treatment, many problems do not emerge during the AA period, and appear only during the AB period. Prior work on this topic has not addressed troubleshooting during the AB period. In this paper, we present lessons learned from experiments on various internet consumer products at Yahoo, as well as diagnostic and remedy procedures. Most of the examples and troubleshooting procedures presented here are generic to online experimentation at other companies. Some, such as traffic splitting problems and outlier problems have been documented before, but others have not previously been described in the literature.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133356084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45