We develop a nonparametric method to infer the drift and diffusion functions of a stochastic differential equation. With this method, we can build predictive models starting with repeated time series and/or high-dimensional longitudinal data. Typical use of the method includes forecasting the future density or distribution of the variable being measured. The key innovation in our method stems from efficient algorithms to evaluate a likelihood function and its gradient. These algorithms do not rely on sampling, instead, they use repeated quadrature and the adjoint method to enable the inference to scale well as the dimensionality of the parameter vector grows. In simulated data tests, when the number of sample paths is large, the method does an excellent job of inferring drift functions close to the ground truth. We show that even when the method does not infer the drift function correctly, it still yields models with good predictive power. Finally, we apply the method to real data on hourly measurements of ground level ozone, showing that it is capable of reasonable results.
{"title":"Nonparametric Adjoint-Based Inference for Stochastic Differential Equations","authors":"H. Bhat, R. Madushani","doi":"10.1109/DSAA.2016.69","DOIUrl":"https://doi.org/10.1109/DSAA.2016.69","url":null,"abstract":"We develop a nonparametric method to infer the drift and diffusion functions of a stochastic differential equation. With this method, we can build predictive models starting with repeated time series and/or high-dimensional longitudinal data. Typical use of the method includes forecasting the future density or distribution of the variable being measured. The key innovation in our method stems from efficient algorithms to evaluate a likelihood function and its gradient. These algorithms do not rely on sampling, instead, they use repeated quadrature and the adjoint method to enable the inference to scale well as the dimensionality of the parameter vector grows. In simulated data tests, when the number of sample paths is large, the method does an excellent job of inferring drift functions close to the ground truth. We show that even when the method does not infer the drift function correctly, it still yields models with good predictive power. Finally, we apply the method to real data on hourly measurements of ground level ozone, showing that it is capable of reasonable results.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114146126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MapReduce has been widely used in many data science applications. It has been observed that an excessive data transfer has a negative impact on its performance. To reduce the amount of data transfer, MapReduce utilizes data locality. However, even though the majority of the processing cost occurs in the later stages, data locality has been utilized only in the early stages, which we call Shallow Data Locality (SDL). As a result, the benefit of data locality has not been fully realized. We have explored a new concept called Deep Data Locality (DDL) where the data is pre-arranged to maximize the locality in the later stages. Toward achieving stronger DDL, we introduce a new block placement paradigm called Limited Node Block Placement Policy (LNBPP). Under the conventional default block placement policy (DBPP), data blocks are randomly placed on any available slave nodes, requiring a copy of RLM (Rack-Local Map) blocks. On the other hand, LNBPP places the blocks in a way to avoid RLMs, reducing the block copying time. The containers without RLM have a more consistent execution time, and when assigned to individual cores on a multicore node, they finish a job faster collectively than the containers under DBPP. LNBPP also rearranges the blocks into a smaller number of nodes (hence Limited Node) and reduces the data transfer time between nodes. These strategies bring a significant performance improvement in Map and Shuffle. Our test result shows that the execution times of Map and Shuffle have been improved by up to 33% and 44% respectively. In this paper, we describe the MapReduce workflow in Hadoop with a simple computational model and introduce the current research directions in each step. We analyze the block placement status and RLM locations in DBPP with the customer review data from TripAdvisor and measure the performances by executing the Terasort Benchmark with various sizes of data. We then compare the performances of LNBPP with DBPP.
{"title":"Performance Improvement of MapReduce Process by Promoting Deep Data Locality","authors":"Sungchul Lee, Ju-Yeon Jo, Yoohwan Kim","doi":"10.1109/DSAA.2016.38","DOIUrl":"https://doi.org/10.1109/DSAA.2016.38","url":null,"abstract":"MapReduce has been widely used in many data science applications. It has been observed that an excessive data transfer has a negative impact on its performance. To reduce the amount of data transfer, MapReduce utilizes data locality. However, even though the majority of the processing cost occurs in the later stages, data locality has been utilized only in the early stages, which we call Shallow Data Locality (SDL). As a result, the benefit of data locality has not been fully realized. We have explored a new concept called Deep Data Locality (DDL) where the data is pre-arranged to maximize the locality in the later stages. Toward achieving stronger DDL, we introduce a new block placement paradigm called Limited Node Block Placement Policy (LNBPP). Under the conventional default block placement policy (DBPP), data blocks are randomly placed on any available slave nodes, requiring a copy of RLM (Rack-Local Map) blocks. On the other hand, LNBPP places the blocks in a way to avoid RLMs, reducing the block copying time. The containers without RLM have a more consistent execution time, and when assigned to individual cores on a multicore node, they finish a job faster collectively than the containers under DBPP. LNBPP also rearranges the blocks into a smaller number of nodes (hence Limited Node) and reduces the data transfer time between nodes. These strategies bring a significant performance improvement in Map and Shuffle. Our test result shows that the execution times of Map and Shuffle have been improved by up to 33% and 44% respectively. In this paper, we describe the MapReduce workflow in Hadoop with a simple computational model and introduce the current research directions in each step. We analyze the block placement status and RLM locations in DBPP with the customer review data from TripAdvisor and measure the performances by executing the Terasort Benchmark with various sizes of data. We then compare the performances of LNBPP with DBPP.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114623366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Credell Simeon, Howard J. Hamilton, Robert J. Hilderman
We present a novel method for classifying hashtag types. Specifically, we apply word segmentation algorithms and lexical resources in order to classify two types of hashtags: those with sentiment information and those without. However, the complex structure of hashtags increases the difficulty of identifying sentiment information. In order to solve this problem, we segment hashtags into smaller semantic units using word segmentation algorithms in conjunction with lexical resources to classify hashtag types. Our experimental results demonstrate that our approach achieves a 14% increase in accuracy over baseline methods for identifying hashtags with sentiment information. Additionally, we achieve over 94% recall using this hashtag type for the subjectivity detection of tweets.
{"title":"Word Segmentation Algorithms with Lexical Resources for Hashtag Classification","authors":"Credell Simeon, Howard J. Hamilton, Robert J. Hilderman","doi":"10.1109/DSAA.2016.80","DOIUrl":"https://doi.org/10.1109/DSAA.2016.80","url":null,"abstract":"We present a novel method for classifying hashtag types. Specifically, we apply word segmentation algorithms and lexical resources in order to classify two types of hashtags: those with sentiment information and those without. However, the complex structure of hashtags increases the difficulty of identifying sentiment information. In order to solve this problem, we segment hashtags into smaller semantic units using word segmentation algorithms in conjunction with lexical resources to classify hashtag types. Our experimental results demonstrate that our approach achieves a 14% increase in accuracy over baseline methods for identifying hashtags with sentiment information. Additionally, we achieve over 94% recall using this hashtag type for the subjectivity detection of tweets.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"83 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131770310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Discrimination discovery and prevention has received intensive attention recently. Discrimination generally refers to an unjustified distinction of individuals based on their membership, or perceived membership, in a certain group, and often occurs when the group is treated less favorably than others. However, existing discrimination discovery and prevention approaches are often limited to examining the relationship between one decision attribute and one protected attribute and do not sufficiently incorporate the effects due to other non-protected attributes. In this paper we develop a single unifying framework that aims to capture and measure discriminations between multiple decision attributes and protected attributes in addition to a set of non-protected attributes. Our approach is based on loglinear modeling. The coefficient values of the fitted loglinear model provide quantitative evidence of discrimination in decision making. The conditional independence graph derived from the fitted graphical loglinear model can be effectively used to capture the existence of discrimination patterns based on Markov properties. We further develop an algorithm to remove discrimination. The idea is modifying those significant coefficients from the fitted loglinear model and using the modified model to generate new data. Our empirical evaluation results show effectiveness of our proposed approach.
{"title":"Using Loglinear Model for Discrimination Discovery and Prevention","authors":"Yongkai Wu, Xintao Wu","doi":"10.1109/DSAA.2016.18","DOIUrl":"https://doi.org/10.1109/DSAA.2016.18","url":null,"abstract":"Discrimination discovery and prevention has received intensive attention recently. Discrimination generally refers to an unjustified distinction of individuals based on their membership, or perceived membership, in a certain group, and often occurs when the group is treated less favorably than others. However, existing discrimination discovery and prevention approaches are often limited to examining the relationship between one decision attribute and one protected attribute and do not sufficiently incorporate the effects due to other non-protected attributes. In this paper we develop a single unifying framework that aims to capture and measure discriminations between multiple decision attributes and protected attributes in addition to a set of non-protected attributes. Our approach is based on loglinear modeling. The coefficient values of the fitted loglinear model provide quantitative evidence of discrimination in decision making. The conditional independence graph derived from the fitted graphical loglinear model can be effectively used to capture the existence of discrimination patterns based on Markov properties. We further develop an algorithm to remove discrimination. The idea is modifying those significant coefficients from the fitted loglinear model and using the modified model to generate new data. Our empirical evaluation results show effectiveness of our proposed approach.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131884127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BITCOIN is a novel decentralized cryptocurrency system which has recently received a great attention from a wider audience. An interesting and unique feature of this system is that the complete list of all the transactions occurred from its inception is publicly available. This enables the investigation of funds movements to uncover interesting properties of the BITCOIN economy. In this paper we present a set of analyses of the user graph, i.e. the graph obtained by an heuristic clustering of the graph of BITCOIN transactions. Our analyses consider an up-to-date BITCOIN blockchain, as in December 2015, after the exponential explosion of the number of transactions occurred in the last two years. The set of analyses we defined includes, among others, the analysis of the time evolution of BITCOIN network, the verification of the "rich get richer" conjecture and the detection of the nodes which are critical for the network connectivity.
{"title":"Uncovering the Bitcoin Blockchain: An Analysis of the Full Users Graph","authors":"D. Maesa, Andrea Marino, L. Ricci","doi":"10.1109/DSAA.2016.52","DOIUrl":"https://doi.org/10.1109/DSAA.2016.52","url":null,"abstract":"BITCOIN is a novel decentralized cryptocurrency system which has recently received a great attention from a wider audience. An interesting and unique feature of this system is that the complete list of all the transactions occurred from its inception is publicly available. This enables the investigation of funds movements to uncover interesting properties of the BITCOIN economy. In this paper we present a set of analyses of the user graph, i.e. the graph obtained by an heuristic clustering of the graph of BITCOIN transactions. Our analyses consider an up-to-date BITCOIN blockchain, as in December 2015, after the exponential explosion of the number of transactions occurred in the last two years. The set of analyses we defined includes, among others, the analysis of the time evolution of BITCOIN network, the verification of the \"rich get richer\" conjecture and the detection of the nodes which are critical for the network connectivity.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125448259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Social networks enable knowledge sharing that inevitably begs the question of expertise analysis. Many online profiles claim expertise, but possessing true expertise is rare. We characterize expertise as projected expertise (claims of a person), perceived expertise (how the crowd perceives the individual) and true expertise (factual). StockTwits, an investor-focused microblogging platform, allows us to study all three aspects simultaneously. We analyze more than 18 million tweets spanning 1700 days. The large time scale allows us to also analyze expertise and its categories as they evolve over time, which is the first study of its kind on StockTwits. We propose a method to capture perceived expertise by how significantly a user's follower network grows and how often the user is brought up in conversations. We also quantify actual, market-based, true expertise based on the user's trade and investment recommendations. Finally we provide an analysis bringing out the differences between how users project themselves, how the crowd perceives them, and how they are actually performing on the market. Our results show that users who project themselves as experts are ones that talk the most and provide the least recommendation-to-tweet ratio (that is, most of their conversations are mundane). The recommendations from users who project novice expertise slightly outperform (≈5%) the overall stock market. On the other hand, the trade recommendations from self-proclaimed experts yield 80% less than those of intermediate traders. Interestingly, users who are perceived as experts by others, as measured by centrality measurements, resulted in net negative returns after a four year trading period. Our study also looks at the evolution of expertise, and begins to understand why and what makes users change the way they project their own expertise. For this topic, however, this paper introduces more questions than it answers, which will serve as the basis for future studies.
{"title":"Perceived, Projected, and True Investment Expertise: Not All Experts Provide Expert Recommendations","authors":"Amit Shavit, Sameena Shah","doi":"10.1109/DSAA.2016.45","DOIUrl":"https://doi.org/10.1109/DSAA.2016.45","url":null,"abstract":"Social networks enable knowledge sharing that inevitably begs the question of expertise analysis. Many online profiles claim expertise, but possessing true expertise is rare. We characterize expertise as projected expertise (claims of a person), perceived expertise (how the crowd perceives the individual) and true expertise (factual). StockTwits, an investor-focused microblogging platform, allows us to study all three aspects simultaneously. We analyze more than 18 million tweets spanning 1700 days. The large time scale allows us to also analyze expertise and its categories as they evolve over time, which is the first study of its kind on StockTwits. We propose a method to capture perceived expertise by how significantly a user's follower network grows and how often the user is brought up in conversations. We also quantify actual, market-based, true expertise based on the user's trade and investment recommendations. Finally we provide an analysis bringing out the differences between how users project themselves, how the crowd perceives them, and how they are actually performing on the market. Our results show that users who project themselves as experts are ones that talk the most and provide the least recommendation-to-tweet ratio (that is, most of their conversations are mundane). The recommendations from users who project novice expertise slightly outperform (≈5%) the overall stock market. On the other hand, the trade recommendations from self-proclaimed experts yield 80% less than those of intermediate traders. Interestingly, users who are perceived as experts by others, as measured by centrality measurements, resulted in net negative returns after a four year trading period. Our study also looks at the evolution of expertise, and begins to understand why and what makes users change the way they project their own expertise. For this topic, however, this paper introduces more questions than it answers, which will serve as the basis for future studies.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129895733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Ginige, A. Walisadeera, T. Ginige, Lasanthi N. C. De Silva, P. D. Giovanni, M. Mathai, J. Goonetillake, G. Wikramanayake, G. Vitiello, M. Sebillo, G. Tortora, Deborah Richards, R. Jain
Crop production problems are common in Sri Lanka which severely effect rural farmers, agriculture sector and the country's economy as a whole. A deeper analysis revealed that the root cause was farmers and other stakeholders in the domain not receiving right information at the right time in the right format. Inspired by the rapid growth of mobile phone usage among farmers a mobile-based solution is sought to overcome this information gap. Farmers needed published information (quasi static) about crops, pests, diseases, land preparation, growing and harvesting methods and real-time situational information (dynamic) such as current crop production and market prices. This situational information is also needed by agriculture department, agro-chemical companies, buyers and various government agencies to ensure food security through effective supply chain planning whilst minimising waste. We developed a notion of context specific actionable information which enables user to act with least amount of further processing. User centered agriculture ontology was developed to convert published quasi static information to actionable information. We adopted empowerment theory to create empowerment-oriented farming processes to motivate farmers to act on this information and aggregated the transaction data to generate situational information. This created a holistic information flow model for agriculture domain similar to energy flow in biological ecosystems. Consequently, the initial Mobile-based Information System evolved into a Digital Knowledge Ecosystem that can predict current production situation in near real enabling government agencies to dynamically adjust the incentives offered to farmers for growing different types of crops to achieve sustainable agriculture production through crop diversification.
{"title":"Digital Knowledge Ecosystem for Achieving Sustainable Agriculture Production: A Case Study from Sri Lanka","authors":"A. Ginige, A. Walisadeera, T. Ginige, Lasanthi N. C. De Silva, P. D. Giovanni, M. Mathai, J. Goonetillake, G. Wikramanayake, G. Vitiello, M. Sebillo, G. Tortora, Deborah Richards, R. Jain","doi":"10.1109/DSAA.2016.82","DOIUrl":"https://doi.org/10.1109/DSAA.2016.82","url":null,"abstract":"Crop production problems are common in Sri Lanka which severely effect rural farmers, agriculture sector and the country's economy as a whole. A deeper analysis revealed that the root cause was farmers and other stakeholders in the domain not receiving right information at the right time in the right format. Inspired by the rapid growth of mobile phone usage among farmers a mobile-based solution is sought to overcome this information gap. Farmers needed published information (quasi static) about crops, pests, diseases, land preparation, growing and harvesting methods and real-time situational information (dynamic) such as current crop production and market prices. This situational information is also needed by agriculture department, agro-chemical companies, buyers and various government agencies to ensure food security through effective supply chain planning whilst minimising waste. We developed a notion of context specific actionable information which enables user to act with least amount of further processing. User centered agriculture ontology was developed to convert published quasi static information to actionable information. We adopted empowerment theory to create empowerment-oriented farming processes to motivate farmers to act on this information and aggregated the transaction data to generate situational information. This created a holistic information flow model for agriculture domain similar to energy flow in biological ecosystems. Consequently, the initial Mobile-based Information System evolved into a Digital Knowledge Ecosystem that can predict current production situation in near real enabling government agencies to dynamically adjust the incentives offered to farmers for growing different types of crops to achieve sustainable agriculture production through crop diversification.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126965436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like Expectation Maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a non-linear hierarchical data structure(heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real world and synthetic data sets. We lastly conclude with some theoretic underpinnings that explain why EM* is successful.
{"title":"EM*: An EM Algorithm for Big Data","authors":"H. Kurban, Mark Jenne, Mehmet M. Dalkilic","doi":"10.1109/DSAA.2016.40","DOIUrl":"https://doi.org/10.1109/DSAA.2016.40","url":null,"abstract":"Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like Expectation Maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a non-linear hierarchical data structure(heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real world and synthetic data sets. We lastly conclude with some theoretic underpinnings that explain why EM* is successful.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132410734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gary M. Weiss, J. W. Lockhart, T. Pulickal, Paul T. McHugh, Isaac H. Ronan, Jessica L. Timko
Actitracker is a smartphone-based activity-monitoring service to help people ensure they receive sufficient activity to maintain proper health. This free service allowed people to set personal activity goals and monitor their progress toward these goals. Actitracker uses machine learning methods to recognize a user's activities. It initially employs a "universal" model generated from labeled activity data from a panel of users, but will automatically shift to a much more accurate personalized model once a user completes a simple training phase. Detailed activity reports and statistics are maintained and provided to the user. Actitracker is a research-based system that began in 2011, before fitness trackers like Fitbit were popular, and was deployed for public use from 2012 until 2015, during which period it had 1,000 registered users. This paper describes the Actitracker system, its use of machine learning, and user experiences. While activity recognition has now entered the mainstream, this paper provides insights into applied activity recognition, something that commercial companies rarely share.
{"title":"Actitracker: A Smartphone-Based Activity Recognition System for Improving Health and Well-Being","authors":"Gary M. Weiss, J. W. Lockhart, T. Pulickal, Paul T. McHugh, Isaac H. Ronan, Jessica L. Timko","doi":"10.1109/DSAA.2016.89","DOIUrl":"https://doi.org/10.1109/DSAA.2016.89","url":null,"abstract":"Actitracker is a smartphone-based activity-monitoring service to help people ensure they receive sufficient activity to maintain proper health. This free service allowed people to set personal activity goals and monitor their progress toward these goals. Actitracker uses machine learning methods to recognize a user's activities. It initially employs a \"universal\" model generated from labeled activity data from a panel of users, but will automatically shift to a much more accurate personalized model once a user completes a simple training phase. Detailed activity reports and statistics are maintained and provided to the user. Actitracker is a research-based system that began in 2011, before fitness trackers like Fitbit were popular, and was deployed for public use from 2012 until 2015, during which period it had 1,000 registered users. This paper describes the Actitracker system, its use of machine learning, and user experiences. While activity recognition has now entered the mainstream, this paper provides insights into applied activity recognition, something that commercial companies rarely share.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132830935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Online experiments are frequently used at internet companies to evaluate the impact of new designs, features, or code changes on user behavior. Though the experiment design is straightforward in theory, in practice, there are many problems that can complicate the interpretation of results and render any conclusions about changes in user behavior invalid. Many of these problems are difficult to detect and often go unnoticed. Acknowledging and diagnosing these issues can prevent experiment owners from making decisions based on fundamentally flawed data. When conducting online experiments, data quality assurance is a top priority before attributing the impact to changes in user behavior. While some problems can be detected by running AA tests before introducing the treatment, many problems do not emerge during the AA period, and appear only during the AB period. Prior work on this topic has not addressed troubleshooting during the AB period. In this paper, we present lessons learned from experiments on various internet consumer products at Yahoo, as well as diagnostic and remedy procedures. Most of the examples and troubleshooting procedures presented here are generic to online experimentation at other companies. Some, such as traffic splitting problems and outlier problems have been documented before, but others have not previously been described in the literature.
{"title":"Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation","authors":"Zhenyu Zhao, Miao Chen, Don Matheson, M. Stone","doi":"10.1109/DSAA.2016.61","DOIUrl":"https://doi.org/10.1109/DSAA.2016.61","url":null,"abstract":"Online experiments are frequently used at internet companies to evaluate the impact of new designs, features, or code changes on user behavior. Though the experiment design is straightforward in theory, in practice, there are many problems that can complicate the interpretation of results and render any conclusions about changes in user behavior invalid. Many of these problems are difficult to detect and often go unnoticed. Acknowledging and diagnosing these issues can prevent experiment owners from making decisions based on fundamentally flawed data. When conducting online experiments, data quality assurance is a top priority before attributing the impact to changes in user behavior. While some problems can be detected by running AA tests before introducing the treatment, many problems do not emerge during the AA period, and appear only during the AB period. Prior work on this topic has not addressed troubleshooting during the AB period. In this paper, we present lessons learned from experiments on various internet consumer products at Yahoo, as well as diagnostic and remedy procedures. Most of the examples and troubleshooting procedures presented here are generic to online experimentation at other companies. Some, such as traffic splitting problems and outlier problems have been documented before, but others have not previously been described in the literature.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133356084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}