This paper presents the design of a Coherence-Free Processor (CFP) that enables a scalable multiprocessor by eliminating cache coherence operations in both hardware and software. The CFP uses a coherence-free cache (CFC) that can improve the cost-effectiveness and performance-effectiveness of the existing multiprocessors for commonly used workloads. The CFC is feasible because not all program data that reside in a multiprocessor cache need to be accessed by other processors, and private caches at level 1 (L1) and level 2 (L2) facilitate this method of sharing. Reentrant programs are specifically designed to protect their data from modification by other tasks. Program data that are modified but not shared with other tasks do not require a coherence protocol. Adding processors reduces the multitasking queue, reducing elapsed time. Simultaneous execution replaces concurrent execution.
Infrastructure-as-a-Service (IaaS) cloud platforms offer resources with diverse buying options. Users can run an instance on the on-demand market which is stable but expensive or on the spot market with a significant discount. However, users have to carefully weigh the low cost of spot instances against their poor availability. Spot instances will be revoked when the revocation event occurs. Thus, an important problem that an IaaS user faces now is how to use spot instances in a cost-effective and low-risk way. Based on the replication-based fault tolerance mechanism, we propose an online termination algorithm that optimizes the cost of using spot instances while ensuring operational stability. We prove that in most cases, the cost of our proposed online algorithm will not exceed twice the minimum cost of the optimal offline algorithm that knows the exact future a priori. Through a large number of experiments, we verify that our algorithm in most cases has a competitive ratio of no more than 2, and in other cases it can also reach the guaranteed competitive ratio.
In recent years, live streaming has become a popular application, which uses TCP as its primary transport protocol. Quick UDP Internet Connections (QUIC) protocol opens up new opportunities for live streaming. However, how to leverage QUIC to transmit live videos has not been studied yet. This paper first investigates the achievable quality of experience (QoE) of streaming live videos over TCP, QUIC, and their multipath extensions Multipath TCP (MPTCP) and Multipath QUIC (MPQUIC). We observe that MPQUIC achieves the best performance with bandwidth aggregation and transmission reliability. However, network fluctuations may cause heterogeneous paths, high path loss, and bandwidth degradation, resulting in significant QoE deterioration. Motivated by the above observations, we investigate the multipath packet scheduling problem in live streaming and design 4D-MAP, a multipath adaptive packet scheduling scheme over QUIC. Specifically, a linear upper confidence bound (LinUCB)-based online learning algorithm, along with four novel scheduling mechanisms, i.e., Dispatch, Duplicate, Discard, and Decompensate, is proposed to conquer the above problems. 4D-MAP has been evaluated in both controlled emulation and real-world networks to make comparison with the state-of-the-art multipath transmission schemes. Experimental results reveal that 4D-MAP outperforms others in terms of improving the QoE of live streaming.
Image bitmaps, i.e., data containing pixels and visual perception, have been widely used in emerging applications for pixel operations while consuming lots of memory space and energy. Compared with legacy DRAM (dynamic random access memory), non-volatile memories (NVMs) are suitable for bitmap storage due to the salient features of high density and intrinsic durability. However, writing NVMs suffers from higher energy consumption and latency compared with read accesses. Existing precise or approximate compression schemes in NVM controllers show limited performance for bitmaps due to the irregular data patterns and variance in bitmaps. We observe the pixel-level similarity when writing bitmaps due to the analogous contents in adjacent pixels. By exploiting the pixel-level similarity, we propose SimCom, an approximate similarity-aware compression scheme in the NVM module controller, to efficiently compress data for each write access on-the-fly. The idea behind SimCom is to compress continuous similar words into the pairs of base words with runs. The storage costs for small runs are further mitigated by reusing the least significant bits of base words. SimCom adaptively selects an appropriate compression mode for various bitmap formats, thus achieving an efficient trade-off between quality and memory performance. We implement SimCom on GEM5/zsim with NVMain and evaluate the performance with real-world image/video workloads. Our results demonstrate the efficacy and efficiency of our SimCom with an efficient quality-performance trade-off.
Network embedding, as an approach to learning low-dimensional representations of nodes, has been proved extremely useful in many applications, e.g., node classification and link prediction. Unfortunately, existing network embedding models are vulnerable to random or adversarial perturbations, which may degrade the performance of network embedding when being applied to downstream tasks. To achieve robust network embedding, researchers introduce adversarial training to regularize the embedding learning process by training on a mixture of adversarial examples and original examples. However, existing methods generate adversarial examples heuristically, failing to guarantee the imperceptibility of generated adversarial examples, and thus limit the power of adversarial training. In this paper, we propose a novel method Identity-Preserving Adversarial Training (IPAT) for network embedding, which generates imperceptible adversarial examples with explicit identity-preserving regularization. We formalize such identity-preserving regularization as a multi-class classification problem where each node represents a class, and we encourage each adversarial example to be discriminated as the class of its original node. Extensive experimental results on real-world datasets demonstrate that our proposed IPAT method significantly improves the robustness of network embedding models and the generalization of the learned node representations on various downstream tasks.
Federated learning has emerged as a distributed learning paradigm by training at each client and aggregating at a parameter server. System heterogeneity hinders stragglers from responding to the server in time with huge communication costs. Although client grouping in federated learning can solve the straggler problem, the stochastic selection strategy in client grouping neglects the impact of data distribution within each group. Besides, current client grouping approaches make clients suffer unfair participation, leading to biased performances for different clients. In order to guarantee the fairness of client participation and mitigate biased local performances, we propose a federated dynamic client selection method based on data representativity (FedSDR). FedSDR clusters clients into groups correlated with their own local computational efficiency. To estimate the significance of client datasets, we design a novel data representativity evaluation scheme based on local data distribution. Furthermore, the two most representative clients in each group are selected to optimize the global model. Finally, the DYNAMIC-SELECT algorithm updates local computational efficiency and data representativity states to regroup clients after periodic average aggregation. Evaluations on real datasets show that FedSDR improves client participation by 27.4%, 37.9%, and 23.3% compared with FedAvg, TiFL, and FedSS, respectively, taking fairness into account in federated learning. In addition, FedSDR surpasses FedAvg, FedGS, and FedMS by 21.32%, 20.4%, and 6.90%, respectively, in local test accuracy variance, balancing the performance bias of the global model across clients.
Most distributed stream processing engines (DSPEs) do not support online task management and cannot adapt to time-varying data flows. Recently, some studies have proposed online task deployment algorithms to solve this problem. However, these approaches do not guarantee the Quality of Service (QoS) when the task deployment changes at runtime, because the task migrations caused by the change of task deployments will impose an exorbitant cost. We study one of the most popular DSPEs, Apache Storm, and find out that when a task needs to be migrated, Storm has to stop the resource (implemented as a process of Worker in Storm) where the task is deployed. This will lead to the stop and restart of all tasks in the resource, resulting in the poor performance of task migrations. Aiming to solve this problem, in this paper, we propose N-Storm (Nonstop Storm), which is a task-resource decoupling DSPE. N-Storm allows tasks allocated to resources to be changed at runtime, which is implemented by a thread-level scheme for task migrations. Particularly, we add a local shared key/value store on each node to make resources aware of the changes in the allocation plan. Thus, each resource can manage its tasks at runtime. Based on N-Storm, we further propose Online Task Deployment (OTD). Differing from traditional task deployment algorithms that deploy all tasks at once without considering the cost of task migrations caused by a task re-deployment, OTD can gradually adjust the current task deployment to an optimized one based on the communication cost and the runtime states of resources. We demonstrate that OTD can adapt to different kinds of applications including computation- and communication-intensive applications. The experimental results on a real DSPE cluster show that N-Storm can avoid the system stop and save up to 87% of the performance degradation time, compared with Apache Storm and other state-of-the-art approaches. In addition, OTD can increase the average CPU usage by 51% for computation-intensive applications and reduce network communication costs by 88% for communication-intensive applications.
Data race is one of the most important concurrent anomalies in multi-threaded programs. Emerging constraint- based techniques are leveraged into race detection, which is able to find all the races that can be found by any other sound race detector. However, this constraint-based approach has serious limitations on helping programmers analyze and understand data races. First, it may report a large number of false positives due to the unrecognized dataflow propagation of the program. Second, it recommends a wide range of thread context switches to schedule the reported race (including the false one) whenever this race is exposed during the constraint-solving process. This ad hoc recommendation imposes too many context switches, which complicates the data race analysis. To address these two limitations in the state-of-the-art constraint-based race detection, this paper proposes DFTracker, an improved constraint-based race detector to recommend each data race with minimal thread context switches. Specifically, we reduce the false positives by analyzing and tracking the dataflow in the program. By this means, DFTracker thus reduces the unnecessary analysis of false race schedules. We further propose a novel algorithm to recommend an effective race schedule with minimal thread context switches for each data race. Our experimental results on the real applications demonstrate that 1) without removing any true data race, DFTracker effectively prunes false positives by 68% in comparison with the state-of-the-art constraint-based race detector; 2) DFTracker recommends as low as 2.6–8.3 (4.7 on average) thread context switches per data race in the real world, which is 81.6% fewer context switches per data race than the state-of-the-art constraint based race detector. Therefore, DFTracker can be used as an effective tool to understand the data race for programmers.