In this paper, the question of interest is estimating true demand of a product at a given store location and time period in the retail environment based on a single noisy and potentially censored observation. To address this question, we introduce a non-parametric framework to make inference from multiple time series. Somewhat surprisingly, we establish that the algorithm introduced for the purpose of "matrix completion" can be used to solve the relevant inference problem. Specifically, using the Universal Singular Value Thresholding (USVT) algorithm [2], we show that our estimator is consistent: the average mean squared error of the estimated average demand with respect to the true average demand goes to 0 as the number of store locations and time intervals increase to ınfty. We establish naturally appealing properties of the resulting estimator both analytically as well as through a sequence of instructive simulations. Using a real dataset in retail (Walmart), we argue for the practical relevance of our approach.
{"title":"Censored Demand Estimation in Retail","authors":"M. Amjad, D. Shah","doi":"10.1145/3219617.3219624","DOIUrl":"https://doi.org/10.1145/3219617.3219624","url":null,"abstract":"In this paper, the question of interest is estimating true demand of a product at a given store location and time period in the retail environment based on a single noisy and potentially censored observation. To address this question, we introduce a non-parametric framework to make inference from multiple time series. Somewhat surprisingly, we establish that the algorithm introduced for the purpose of \"matrix completion\" can be used to solve the relevant inference problem. Specifically, using the Universal Singular Value Thresholding (USVT) algorithm [2], we show that our estimator is consistent: the average mean squared error of the estimated average demand with respect to the true average demand goes to 0 as the number of store locations and time intervals increase to ınfty. We establish naturally appealing properties of the resulting estimator both analytically as well as through a sequence of instructive simulations. Using a real dataset in retail (Walmart), we argue for the practical relevance of our approach.","PeriodicalId":210440,"journal":{"name":"Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124612327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The QoS buffer management problem, with significant and diverse computer applications, e.g., in online cloud resource allocation problems, is a classic online admission control problem in the presence of resource constraints. In its basic setting, packets with different values, arrive in online fashion to a switching node with limited buffer size. Then, the switch needs to make an immediate decision to either admit or reject the incoming packet based on the value of the packet and its buffer availability. The objective is to maximize the cumulative profit of the admitted packets, while respecting the buffer constraint. Even though the QoS buffer management problem was proposed more than a decade ago, no optimal online solution has been proposed in the literature. This paper proposes an optimal randomized online algorithm for this problem.
{"title":"An Optimal Randomized Online Algorithm for QoS Buffer Management","authors":"L. Yang, W. Wong, M. Hajiesmaili","doi":"10.1145/3219617.3219629","DOIUrl":"https://doi.org/10.1145/3219617.3219629","url":null,"abstract":"The QoS buffer management problem, with significant and diverse computer applications, e.g., in online cloud resource allocation problems, is a classic online admission control problem in the presence of resource constraints. In its basic setting, packets with different values, arrive in online fashion to a switching node with limited buffer size. Then, the switch needs to make an immediate decision to either admit or reject the incoming packet based on the value of the packet and its buffer availability. The objective is to maximize the cumulative profit of the admitted packets, while respecting the buffer constraint. Even though the QoS buffer management problem was proposed more than a decade ago, no optimal online solution has been proposed in the literature. This paper proposes an optimal randomized online algorithm for this problem.","PeriodicalId":210440,"journal":{"name":"Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems","volume":"68 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120995408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tomographic techniques can be used for the fast detection of link failures at low cost. Our paper studies the impact of the routing model on tomographic node placement costs. We present a taxonomy of path routing models and provide optimal and near-optimal algorithms to deploy a minimal number of asymmetric and symmetric tomography nodes for basic network topologies under different routing model classes. Intriguingly, we find that in many cases routing according to a more restrictive routing model gives better results: compared to a more general routing model, computing a good placement is algorithmically more tractable and does not entail high monitoring costs, a desirable trade-off in practice.
{"title":"Tomographic Node Placement Strategies and the Impact of the Routing Model","authors":"Y. Pignolet, S. Schmid, Gilles Trédan","doi":"10.1145/3219617.3219648","DOIUrl":"https://doi.org/10.1145/3219617.3219648","url":null,"abstract":"Tomographic techniques can be used for the fast detection of link failures at low cost. Our paper studies the impact of the routing model on tomographic node placement costs. We present a taxonomy of path routing models and provide optimal and near-optimal algorithms to deploy a minimal number of asymmetric and symmetric tomography nodes for basic network topologies under different routing model classes. Intriguingly, we find that in many cases routing according to a more restrictive routing model gives better results: compared to a more general routing model, computing a good placement is algorithmically more tractable and does not entail high monitoring costs, a desirable trade-off in practice.","PeriodicalId":210440,"journal":{"name":"Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115132234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Backhaul transport network design and optimization for cellular service providers involve a unique challenge stemming from the fact that an end-user's equipment (UE) is within the radio reach of multiple cellular towers: It is hard to evaluate the impact of the failure of the UE's primary serving tower on the UE, because the UE may simply switch to get service from other nearby cellular towers. To overcome this challenge, one needs to quantify the cellular service redundancy among the cellular towers riding on that transport circuit and their nearby cellular towers, which in turn requires a comprehensive understanding of the radio signal profile in the area of the impacted towers, the spatial distribution of UEs therein, and their expected workload (e.g., calls, data throughput). In this work, we develop a novel methodology for assessing the service impact of any hypothetical cellular tower outage scenario, and implement it in an operational system named Tower Outage Impact Predictor (TOIP). Our evaluations, using both synthetic data and historical real tower outages in a large operational cellular network, show conclusively that TOIP gives an accurate assessment of various tower outage scenarios, and can provide critical input data towards designing a reliable cellular backhaul transport network.
{"title":"Predictive Impact Analysis for Designing a Resilient Cellular Backhaul Network","authors":"Sen Yang, He Yan, Zihui Ge, Dongmei Wang, Jun Xu","doi":"10.1145/3219617.3219651","DOIUrl":"https://doi.org/10.1145/3219617.3219651","url":null,"abstract":"Backhaul transport network design and optimization for cellular service providers involve a unique challenge stemming from the fact that an end-user's equipment (UE) is within the radio reach of multiple cellular towers: It is hard to evaluate the impact of the failure of the UE's primary serving tower on the UE, because the UE may simply switch to get service from other nearby cellular towers. To overcome this challenge, one needs to quantify the cellular service redundancy among the cellular towers riding on that transport circuit and their nearby cellular towers, which in turn requires a comprehensive understanding of the radio signal profile in the area of the impacted towers, the spatial distribution of UEs therein, and their expected workload (e.g., calls, data throughput). In this work, we develop a novel methodology for assessing the service impact of any hypothetical cellular tower outage scenario, and implement it in an operational system named Tower Outage Impact Predictor (TOIP). Our evaluations, using both synthetic data and historical real tower outages in a large operational cellular network, show conclusively that TOIP gives an accurate assessment of various tower outage scenarios, and can provide critical input data towards designing a reliable cellular backhaul transport network.","PeriodicalId":210440,"journal":{"name":"Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems","volume":"254 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132826405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the distributed statistical learning problem over decentralized systems that are prone to adversarial attacks. This setup arises in many practical applications, including Google's Federated Learning. Formally, we focus on a decentralized system that consists of a parameter server and m working machines; each working machine keeps N/m data samples, where N is the total number of samples. In each iteration, up to q of the m working machines suffer Byzantine faults -- a faulty machine in the given iteration behaves arbitrarily badly against the system and has complete knowledge of the system. Additionally, the sets of faulty machines may be different across iterations. Our goal is to design robust algorithms such that the system can learn the underlying true parameter, which is of dimension d, despite the interruption of the Byzantine attacks. In this paper, based on the geometric median of means of the gradients, we propose a simple variant of the classical gradient descent method. We show that our method can tolerate q Byzantine failures up to 2(1+ε)q łe m for an arbitrarily small but fixed constant ε>0. The parameter estimate converges in O(łog N) rounds with an estimation error on the order of max √dq/N, ~√d/N , which is larger than the minimax-optimal error rate √d/N in the centralized and failure-free setting by at most a factor of √q . The total computational complexity of our algorithm is of O((Nd/m) log N) at each working machine and O(md + kd log 3 N) at the central server, and the total communication cost is of O(m d log N). We further provide an application of our general results to the linear regression problem. A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. To handle this issue in the analysis, we prove that the aggregated gradient, as a function of model parameter, converges uniformly to the true gradient function.
{"title":"Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent","authors":"Yudong Chen, Lili Su, Jiaming Xu","doi":"10.1145/3219617.3219655","DOIUrl":"https://doi.org/10.1145/3219617.3219655","url":null,"abstract":"We consider the distributed statistical learning problem over decentralized systems that are prone to adversarial attacks. This setup arises in many practical applications, including Google's Federated Learning. Formally, we focus on a decentralized system that consists of a parameter server and m working machines; each working machine keeps N/m data samples, where N is the total number of samples. In each iteration, up to q of the m working machines suffer Byzantine faults -- a faulty machine in the given iteration behaves arbitrarily badly against the system and has complete knowledge of the system. Additionally, the sets of faulty machines may be different across iterations. Our goal is to design robust algorithms such that the system can learn the underlying true parameter, which is of dimension d, despite the interruption of the Byzantine attacks. In this paper, based on the geometric median of means of the gradients, we propose a simple variant of the classical gradient descent method. We show that our method can tolerate q Byzantine failures up to 2(1+ε)q łe m for an arbitrarily small but fixed constant ε>0. The parameter estimate converges in O(łog N) rounds with an estimation error on the order of max √dq/N, ~√d/N , which is larger than the minimax-optimal error rate √d/N in the centralized and failure-free setting by at most a factor of √q . The total computational complexity of our algorithm is of O((Nd/m) log N) at each working machine and O(md + kd log 3 N) at the central server, and the total communication cost is of O(m d log N). We further provide an application of our general results to the linear regression problem. A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. To handle this issue in the analysis, we prove that the aggregated gradient, as a function of model parameter, converges uniformly to the true gradient function.","PeriodicalId":210440,"journal":{"name":"Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124069615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce a new framework for the analysis of large-scale load balancing networks with general service time distributions, motivated by applications in server farms, distributed memory machines, cloud computing and communication systems. For a parallel server network using the so-called $SQ(d)$ load balancing routing policy, we use a novel representation for the state of the system and identify its fluid limit, when the number of servers goes to infinity and the arrival rate per server tends to a constant. The fluid limit is characterized as the unique solution to a countable system of coupled partial differential equations (PDE), which serve to approximate transient Quality of Service parameters such as the expected virtual waiting time and queue length distribution. In the special case when the service time distribution is exponential, our method recovers the well-known ordinary differential equation characterization of the fluid limit. Furthermore, we develop a numerical scheme to solve the PDE, and demonstrate the efficacy of the PDE approximation by comparing it with Monte Carlo simulations. We also illustrate how the PDE can be used to gain insight into the performance of large networks in practical scenarios by analyzing relaxation times in a backlogged network. In particular, our numerical approximation of the PDE uncovers two interesting properties of relaxation times under the SQ(2) algorithm. Firstly, when the service time distribution is Pareto with unit mean, the relaxation time decreases as the tail becomes heavier. This is a priori counterintuitive given that for the Pareto distribution, heavier tails have been shown to lead to worse tail behavior in equilibrium. Secondly, for unit mean light-tailed service distributions such as the Weibull and lognormal, the relaxation time decreases as the variance increases. This is in contrast to the behavior observed under random routing, where the relaxation time increases with increase in variance.
{"title":"The PDE Method for the Analysis of Randomized Load Balancing Networks","authors":"Reza Aghajani, Xingjie Li, K. Ramanan","doi":"10.1145/3219617.3219672","DOIUrl":"https://doi.org/10.1145/3219617.3219672","url":null,"abstract":"We introduce a new framework for the analysis of large-scale load balancing networks with general service time distributions, motivated by applications in server farms, distributed memory machines, cloud computing and communication systems. For a parallel server network using the so-called $SQ(d)$ load balancing routing policy, we use a novel representation for the state of the system and identify its fluid limit, when the number of servers goes to infinity and the arrival rate per server tends to a constant. The fluid limit is characterized as the unique solution to a countable system of coupled partial differential equations (PDE), which serve to approximate transient Quality of Service parameters such as the expected virtual waiting time and queue length distribution. In the special case when the service time distribution is exponential, our method recovers the well-known ordinary differential equation characterization of the fluid limit. Furthermore, we develop a numerical scheme to solve the PDE, and demonstrate the efficacy of the PDE approximation by comparing it with Monte Carlo simulations. We also illustrate how the PDE can be used to gain insight into the performance of large networks in practical scenarios by analyzing relaxation times in a backlogged network. In particular, our numerical approximation of the PDE uncovers two interesting properties of relaxation times under the SQ(2) algorithm. Firstly, when the service time distribution is Pareto with unit mean, the relaxation time decreases as the tail becomes heavier. This is a priori counterintuitive given that for the Pareto distribution, heavier tails have been shown to lead to worse tail behavior in equilibrium. Secondly, for unit mean light-tailed service distributions such as the Weibull and lognormal, the relaxation time decreases as the variance increases. This is in contrast to the behavior observed under random routing, where the relaxation time increases with increase in variance.","PeriodicalId":210440,"journal":{"name":"Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems","volume":"508 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131406889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefan Magureanu, A. Proutière, Marcus Isaksson, Boxun Zhang
Search engines answer users' queries by listing relevant items (e.g. documents, songs, products, web pages, ...). These engines rely on algorithms that learn to rank items so as to present an ordered list maximizing the probability that it contains relevant item. The main challenge in the design of learning-to-rank algorithms stems from the fact that queries often have different meanings for different users. In absence of any contextual information about the query, one often has to adhere to the diversity principle, i.e., to return a list covering the various possible topics or meanings of the query. To formalize this learning-to-rank problem, we propose a natural model where (i) items are categorized into topics, (ii) users find items relevant only if they match the topic of their query, and (iii) the engine is not aware of the topic of an arriving query, nor of the frequency at which queries related to various topics arrive, nor of the topic-dependent click-through-rates of the items. For this problem, we devise LDR (Learning Diverse Rankings), an algorithm that efficiently learns the optimal list based on users' feedback only. We show that after T queries, the regret of LDR scales as O((N-L)log(T)) where N is the number of all items. This scaling cannot be improved, i.e., LDR is order optimal.
搜索引擎通过列出相关项目(如文档、歌曲、产品、网页等)来回答用户的查询。这些引擎依赖于学习对项目进行排序的算法,以便呈现一个有序的列表,最大限度地提高它包含相关项目的概率。排序学习算法设计中的主要挑战源于这样一个事实,即查询对于不同的用户通常具有不同的含义。在没有任何关于查询的上下文信息的情况下,通常必须遵循多样性原则,即返回一个涵盖查询的各种可能主题或含义的列表。为了形式化这个学习排序问题,我们提出了一个自然模型,其中(i)项目被分类为主题,(ii)用户只有在他们的查询主题匹配时才能找到相关的项目,以及(iii)引擎不知道到达查询的主题,也不知道与各种主题相关的查询到达的频率,也不知道与主题相关的项目的点击率。针对这个问题,我们设计了LDR (Learning Diverse Rankings)算法,该算法仅根据用户的反馈有效地学习最优列表。我们表明,经过T次查询后,LDR的后悔规模为O((N- l)log(T)),其中N为所有项目的数量。这种扩展无法改进,即LDR是顺序最优的。
{"title":"Online Learning of Optimally Diverse Rankings","authors":"Stefan Magureanu, A. Proutière, Marcus Isaksson, Boxun Zhang","doi":"10.1145/3219617.3219637","DOIUrl":"https://doi.org/10.1145/3219617.3219637","url":null,"abstract":"Search engines answer users' queries by listing relevant items (e.g. documents, songs, products, web pages, ...). These engines rely on algorithms that learn to rank items so as to present an ordered list maximizing the probability that it contains relevant item. The main challenge in the design of learning-to-rank algorithms stems from the fact that queries often have different meanings for different users. In absence of any contextual information about the query, one often has to adhere to the diversity principle, i.e., to return a list covering the various possible topics or meanings of the query. To formalize this learning-to-rank problem, we propose a natural model where (i) items are categorized into topics, (ii) users find items relevant only if they match the topic of their query, and (iii) the engine is not aware of the topic of an arriving query, nor of the frequency at which queries related to various topics arrive, nor of the topic-dependent click-through-rates of the items. For this problem, we devise LDR (Learning Diverse Rankings), an algorithm that efficiently learns the optimal list based on users' feedback only. We show that after T queries, the regret of LDR scales as O((N-L)log(T)) where N is the number of all items. This scaling cannot be improved, i.e., LDR is order optimal.","PeriodicalId":210440,"journal":{"name":"Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124552235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivated by applications in machine learning and statistics, we study distributed optimization problems over a network of processors, where the goal is to optimize a global objective composed of a sum of local functions. In these problems, due to the large scale of the data sets, the data and computation must be distributed over multiple processors resulting in the need for distributed algorithms. In this paper, we consider a popular distributed gradient-based consensus algorithm, which only requires local computation and communication. An important problem in this area is to analyze the convergence rate of such algorithms in the presence of communication delays that are inevitable in distributed systems. We prove the convergence of the gradient-based consensus algorithm in the presence of uniform, but possibly arbitrarily large, communication delays between the processors. Moreover, we obtain an upper bound on the rate of convergence of the algorithm as a function of the network size, topology, and the inter-processor communication delays.
{"title":"On the Convergence Rate of Distributed Gradient Methods for Finite-Sum Optimization under Communication Delays","authors":"T. Doan, Carolyn L. Beck, R. Srikant","doi":"10.1145/3219617.3219654","DOIUrl":"https://doi.org/10.1145/3219617.3219654","url":null,"abstract":"Motivated by applications in machine learning and statistics, we study distributed optimization problems over a network of processors, where the goal is to optimize a global objective composed of a sum of local functions. In these problems, due to the large scale of the data sets, the data and computation must be distributed over multiple processors resulting in the need for distributed algorithms. In this paper, we consider a popular distributed gradient-based consensus algorithm, which only requires local computation and communication. An important problem in this area is to analyze the convergence rate of such algorithms in the presence of communication delays that are inevitable in distributed systems. We prove the convergence of the gradient-based consensus algorithm in the presence of uniform, but possibly arbitrarily large, communication delays between the processors. Moreover, we obtain an upper bound on the rate of convergence of the algorithm as a function of the network size, topology, and the inter-processor communication delays.","PeriodicalId":210440,"journal":{"name":"Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128867493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Schardl, Tyler Denniston, Damon Doucet, Bradley C. Kuszmaul, I. Lee, C. Leiserson
The CSI framework provides comprehensive static instrumentation that a compiler can insert into a program-under-test so that dynamic-analysis tools - memory checkers, race detectors, cache simulators, performance profilers, code-coverage analyzers, etc. - can observe and investigate runtime behavior. Heretofore, tools based on compiler instrumentation would each separately modify the compiler to insert their own instrumentation. In contrast, CSI inserts a standard collection of instrumentation hooks into the program-under-test. Each CSI-tool is implemented as a library that defines relevant hooks, and the remaining hooks are "nulled" out and elided during either compile-time or link-time optimization, resulting in instrumented runtimes on par with custom instrumentation. CSI allows many compiler-based tools to be written as simple libraries without modifying the compiler, lowering the bar for the development of dynamic-analysis tools. We have defined a standard API for CSI and modified LLVM to insert CSI hooks into the compiler's internal representation (IR) of the program. The API organizes IR objects - such as functions, basic blocks, and memory accesses - into flat and compact ID spaces, which not only simplifies the building of tools, but surprisingly enables faster maintenance of IR-object data than do traditional hash tables. CSI hooks contain a "property" parameter that allows tools to customize behavior based on static information without introducing overhead. CSI provides "forensic" tables that tools can use to associate IR objects with source-code locations and to relate IR objects to each other. To evaluate the efficacy of CSI, we implemented six demonstration CSI-tools. One of our studies shows that compiling with CSI and linking with the "null" CSI-tool produces a tool-instrumented executable that is as fast as the original uninstrumented code. Another study, using a CSI port of Google's ThreadSanitizer, shows that the CSI-tool rivals the performance of Google's custom compiler-based implementation. All other demonstration CSI tools slow down the execution of the program-under-test by less than 70%.
{"title":"The CSI Framework for Compiler-Inserted Program Instrumentation","authors":"T. Schardl, Tyler Denniston, Damon Doucet, Bradley C. Kuszmaul, I. Lee, C. Leiserson","doi":"10.1145/3219617.3219657","DOIUrl":"https://doi.org/10.1145/3219617.3219657","url":null,"abstract":"The CSI framework provides comprehensive static instrumentation that a compiler can insert into a program-under-test so that dynamic-analysis tools - memory checkers, race detectors, cache simulators, performance profilers, code-coverage analyzers, etc. - can observe and investigate runtime behavior. Heretofore, tools based on compiler instrumentation would each separately modify the compiler to insert their own instrumentation. In contrast, CSI inserts a standard collection of instrumentation hooks into the program-under-test. Each CSI-tool is implemented as a library that defines relevant hooks, and the remaining hooks are \"nulled\" out and elided during either compile-time or link-time optimization, resulting in instrumented runtimes on par with custom instrumentation. CSI allows many compiler-based tools to be written as simple libraries without modifying the compiler, lowering the bar for the development of dynamic-analysis tools. We have defined a standard API for CSI and modified LLVM to insert CSI hooks into the compiler's internal representation (IR) of the program. The API organizes IR objects - such as functions, basic blocks, and memory accesses - into flat and compact ID spaces, which not only simplifies the building of tools, but surprisingly enables faster maintenance of IR-object data than do traditional hash tables. CSI hooks contain a \"property\" parameter that allows tools to customize behavior based on static information without introducing overhead. CSI provides \"forensic\" tables that tools can use to associate IR objects with source-code locations and to relate IR objects to each other. To evaluate the efficacy of CSI, we implemented six demonstration CSI-tools. One of our studies shows that compiling with CSI and linking with the \"null\" CSI-tool produces a tool-instrumented executable that is as fast as the original uninstrumented code. Another study, using a CSI port of Google's ThreadSanitizer, shows that the CSI-tool rivals the performance of Google's custom compiler-based implementation. All other demonstration CSI tools slow down the execution of the program-under-test by less than 70%.","PeriodicalId":210440,"journal":{"name":"Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129285754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We establish a unified analytical framework for designing load balancing algorithms that can simultaneously achieve low latency, low complexity, and low communication overhead. We first propose a general class ¶ of load balancing policies and prove that they are both throughput optimal and heavy-traffic delay optimal. This class ¶ includes popular policies such as join-shortest-queue (JSQ) and power-of- d as special cases, but not the recently proposed join-idle-queue (JIQ) policy. In fact, we show that JIQ is not heavy-traffic delay optimal even for homogeneous servers. By exploiting the flexibility offered by the class ¶, we design a new load balancing policy called join-below-threshold (JBT-d), in which the arrival jobs are preferentially assigned to queues that are no greater than a threshold, and the threshold is updated infrequently. JBT-d has several benefits: (i) JBT-d belongs to the class ¶i and hence is throughput optimal and heavy-traffic delay optimal. (ii) JBT-d has zero dispatching delay, like JIQ and other pull-based policies, and low message overhead due to infrequent threshold updates. (iii) Extensive simulations show that JBT-d has good delay performance, comparable to the JSQ policy in various system settings.
{"title":"Designing Low-Complexity Heavy-Traffic Delay-Optimal Load Balancing Schemes: Theory to Algorithms","authors":"Xingyu Zhou, Fei Wu, Jian Tan, Yin Sun, N. Shroff","doi":"10.1145/3219617.3219670","DOIUrl":"https://doi.org/10.1145/3219617.3219670","url":null,"abstract":"We establish a unified analytical framework for designing load balancing algorithms that can simultaneously achieve low latency, low complexity, and low communication overhead. We first propose a general class ¶ of load balancing policies and prove that they are both throughput optimal and heavy-traffic delay optimal. This class ¶ includes popular policies such as join-shortest-queue (JSQ) and power-of- d as special cases, but not the recently proposed join-idle-queue (JIQ) policy. In fact, we show that JIQ is not heavy-traffic delay optimal even for homogeneous servers. By exploiting the flexibility offered by the class ¶, we design a new load balancing policy called join-below-threshold (JBT-d), in which the arrival jobs are preferentially assigned to queues that are no greater than a threshold, and the threshold is updated infrequently. JBT-d has several benefits: (i) JBT-d belongs to the class ¶i and hence is throughput optimal and heavy-traffic delay optimal. (ii) JBT-d has zero dispatching delay, like JIQ and other pull-based policies, and low message overhead due to infrequent threshold updates. (iii) Extensive simulations show that JBT-d has good delay performance, comparable to the JSQ policy in various system settings.","PeriodicalId":210440,"journal":{"name":"Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133633637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}