Pub Date : 2015-04-13DOI: 10.1109/RTAS.2015.7108434
Qi Wang, Yuxin Ren, Matt Scaperoth, Gabriel Parmer
Multi- and many-core systems are increasingly prevalent in embedded systems. Additionally, isolation requirements between different partitions and criticalities are gaining in importance. This difficult combination is not well addressed by current software systems. Parallel systems require consistency guarantees on shared data-structures often provided by locks that use predictable resource sharing protocols. However, as the number of cores increase, even a single shared cache-line (e.g. for the lock) can cause significant interference. In this paper, we present a clean-slate design of the SPeCK kernel, the next generation of our COMPOSITE OS, that attempts to provide a strong version of scalable predictability - where predictability bounds made on a single core, remain constant with an increase in cores. Results show that, despite using a non-preemptive kernel, it has strong scalable predictability, low average-case overheads, and demonstrates better response-times than a state-of-the-art preemptive system.
{"title":"SPeCK: a kernel for scalable predictability","authors":"Qi Wang, Yuxin Ren, Matt Scaperoth, Gabriel Parmer","doi":"10.1109/RTAS.2015.7108434","DOIUrl":"https://doi.org/10.1109/RTAS.2015.7108434","url":null,"abstract":"Multi- and many-core systems are increasingly prevalent in embedded systems. Additionally, isolation requirements between different partitions and criticalities are gaining in importance. This difficult combination is not well addressed by current software systems. Parallel systems require consistency guarantees on shared data-structures often provided by locks that use predictable resource sharing protocols. However, as the number of cores increase, even a single shared cache-line (e.g. for the lock) can cause significant interference. In this paper, we present a clean-slate design of the SPeCK kernel, the next generation of our COMPOSITE OS, that attempts to provide a strong version of scalable predictability - where predictability bounds made on a single core, remain constant with an increase in cores. Results show that, despite using a non-preemptive kernel, it has strong scalable predictability, low average-case overheads, and demonstrates better response-times than a state-of-the-art preemptive system.","PeriodicalId":320300,"journal":{"name":"21st IEEE Real-Time and Embedded Technology and Applications Symposium","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131761816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-04-13DOI: 10.1109/RTAS.2015.7108416
Shen Li, Shaohan Hu, T. Abdelzaher
This paper develops new schedulability bounds for a simplified MapReduce workflow model. MapReduce is a distributed computing paradigm, deployed in industry for over a decade. Different from conventional multiprocessor platforms, MapReduce deployments usually span thousands of machines, and a MapReduce job may contain as many as tens of thousands of parallel segments. State-of-the-art MapReduce workflow schedulers operate in a best-effort fashion, but the need for real-time operation has grown with the emergence of real-time analytic applications. MapReduce workflow details can be captured by the generalized parallel task model from recent real-time literature. Under this model, the best-known result guarantees schedulability if the task set utilization stays below 50% of total capacity, and the deadline to critical path length ratio, which we call the stretch φ, surpasses 2. This paper improves this bound further by introducing a hierarchical scheduling scheme based on the novel notion of a Packing Server, inspired by servers for aperiodic tasks. The Packing Server consists of multiple periodically replenished budgets that can execute in parallel and that appear as independent tasks to the underlying scheduler. Hence, the original problem of scheduling MapReduce workflows reduces to that of scheduling independent tasks. We prove that the utilization bound for schedulability of MapReduce workflows is UB · φ-β/φ , where UB is the utilization bound of the underlying independent task scheduling policy, and β is a tunable parameter that controls the maximum individual budget utilization. By leveraging past schedulability results for independent tasks on multiprocessors, we improve schedulable utilization of DAG workflows above 50% of total capacity, when the number of processors is large and the largest server budget is (sufficiently) smaller than its deadline. This surpasses the best known bounds for the generalized parallel task model. Our evaluation using a Yahoo! MapReduce trace as well as a physical cluster of 46 machines confirms the validity of the new utilization bound for MapReduce workflows.
{"title":"The Packing Server for real-time scheduling of MapReduce workflows","authors":"Shen Li, Shaohan Hu, T. Abdelzaher","doi":"10.1109/RTAS.2015.7108416","DOIUrl":"https://doi.org/10.1109/RTAS.2015.7108416","url":null,"abstract":"This paper develops new schedulability bounds for a simplified MapReduce workflow model. MapReduce is a distributed computing paradigm, deployed in industry for over a decade. Different from conventional multiprocessor platforms, MapReduce deployments usually span thousands of machines, and a MapReduce job may contain as many as tens of thousands of parallel segments. State-of-the-art MapReduce workflow schedulers operate in a best-effort fashion, but the need for real-time operation has grown with the emergence of real-time analytic applications. MapReduce workflow details can be captured by the generalized parallel task model from recent real-time literature. Under this model, the best-known result guarantees schedulability if the task set utilization stays below 50% of total capacity, and the deadline to critical path length ratio, which we call the stretch φ, surpasses 2. This paper improves this bound further by introducing a hierarchical scheduling scheme based on the novel notion of a Packing Server, inspired by servers for aperiodic tasks. The Packing Server consists of multiple periodically replenished budgets that can execute in parallel and that appear as independent tasks to the underlying scheduler. Hence, the original problem of scheduling MapReduce workflows reduces to that of scheduling independent tasks. We prove that the utilization bound for schedulability of MapReduce workflows is UB · φ-β/φ , where UB is the utilization bound of the underlying independent task scheduling policy, and β is a tunable parameter that controls the maximum individual budget utilization. By leveraging past schedulability results for independent tasks on multiprocessors, we improve schedulable utilization of DAG workflows above 50% of total capacity, when the number of processors is large and the largest server budget is (sufficiently) smaller than its deadline. This surpasses the best known bounds for the generalized parallel task model. Our evaluation using a Yahoo! MapReduce trace as well as a physical cluster of 46 machines confirms the validity of the new utilization bound for MapReduce workflows.","PeriodicalId":320300,"journal":{"name":"21st IEEE Real-Time and Embedded Technology and Applications Symposium","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127649567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-04-13DOI: 10.1109/RTAS.2015.7108452
A. Alhammad, Saud Wasly, R. Pellizzoni
Current computing architectures are commonly built with multiple cores and a single shared main memory. Even though this architecture increases the overall computation power, main memory can easily become a bottleneck. Simultaneous access to main memory from multiple cores can cause both (1) severe degradation in performance and (2) unpredictable execution time for real-time applications. We propose in this paper to mitigate these two problems by co-scheduling cores as well as the main memory for predictable execution. In particular, we use a DMA component to overlap memory with computation for hiding the memory latency and therefore increasing the system performance. The main contribution of this paper is a novel global co-scheduling algorithm along with its associated schedulability analysis for sporadic hard real-time tasks. We evaluated our system by generating synthetic tasksets based on real benchmark parameters. The results show a significant improvement in system utilization while retaining a predictable system behavior.
{"title":"Memory efficient global scheduling of real-time tasks","authors":"A. Alhammad, Saud Wasly, R. Pellizzoni","doi":"10.1109/RTAS.2015.7108452","DOIUrl":"https://doi.org/10.1109/RTAS.2015.7108452","url":null,"abstract":"Current computing architectures are commonly built with multiple cores and a single shared main memory. Even though this architecture increases the overall computation power, main memory can easily become a bottleneck. Simultaneous access to main memory from multiple cores can cause both (1) severe degradation in performance and (2) unpredictable execution time for real-time applications. We propose in this paper to mitigate these two problems by co-scheduling cores as well as the main memory for predictable execution. In particular, we use a DMA component to overlap memory with computation for hiding the memory latency and therefore increasing the system performance. The main contribution of this paper is a novel global co-scheduling algorithm along with its associated schedulability analysis for sporadic hard real-time tasks. We evaluated our system by generating synthetic tasksets based on real benchmark parameters. The results show a significant improvement in system utilization while retaining a predictable system behavior.","PeriodicalId":320300,"journal":{"name":"21st IEEE Real-Time and Embedded Technology and Applications Symposium","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124346562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-04-13DOI: 10.1109/RTAS.2015.7108442
L. Sigrist, G. Giannopoulou, Pengcheng Huang, Andres Gomez, L. Thiele
Multicore systems are being increasingly used for embedded system deployments, even in safety-critical domains. Co-hosting applications of different criticality levels in the same platform requires sufficient isolation among them, which has given rise to the mixed-criticality scheduling problem and several recently proposed policies. Such policies typically employ runtime mechanisms to monitor task execution, detect exceptional events like task overruns, and react by switching scheduling mode. Implementing such mechanisms efficiently is crucial for any scheduler to detect runtime events and react in a timely manner, without compromising the system's safety. This paper investigates implementation alternatives for these mechanisms and empirically evaluates the effect of their runtime overhead on the schedulability of mixed-criticality applications. Specifically, we implement in user-space two state-of-the-art scheduling policies: the flexible time-triggered FTTS [1] and the partitioned EDFVD [2], and measure their runtime overheads on a 60-core Intel R Xeon Phi and a 4-core Intel R Core i5 for the first time. Based on extensive executions of synthetic task sets and an industrial avionic application, we show that these overheads cannot be neglected, esp. on massively multicore architectures, where they can incur a schedulability loss up to 97%. Evaluating runtime mechanisms early in the design phase and integrating their overheads into schedulability analysis seem therefore inevitable steps in the design of mixed-criticality systems. The need for verifiably bounded overheads motivates the development of novel timing-predictable architectures and runtime environments specifically targeted for mixed-criticality applications.
多核系统越来越多地用于嵌入式系统部署,甚至在安全关键领域。在同一平台上共同托管不同临界级别的应用程序需要充分的隔离,这就产生了混合临界调度问题和最近提出的一些策略。此类策略通常使用运行时机制来监视任务执行,检测任务超时等异常事件,并通过切换调度模式做出反应。对于任何调度器来说,有效地实现这种机制对于检测运行时事件并及时做出反应,而不损害系统的安全性至关重要。本文研究了这些机制的实现方案,并经验地评估了它们的运行时开销对混合临界应用程序的可调度性的影响。具体来说,我们在用户空间实现了两种最先进的调度策略:灵活的时间触发FTTS[1]和分区的EDFVD[2],并首次在60核Intel R Xeon Phi和4核Intel R Core i5上测量了它们的运行时开销。基于合成任务集的大量执行和工业航空电子应用,我们表明这些开销不能被忽视,特别是在大规模多核架构上,它们可能导致高达97%的可调度性损失。因此,在设计阶段早期评估运行时机制,并将其开销集成到可调度性分析中,似乎是混合临界系统设计中不可避免的步骤。对可验证的有限开销的需求促使开发专门针对混合临界应用程序的新型时间可预测架构和运行时环境。
{"title":"Mixed-criticality runtime mechanisms and evaluation on multicores","authors":"L. Sigrist, G. Giannopoulou, Pengcheng Huang, Andres Gomez, L. Thiele","doi":"10.1109/RTAS.2015.7108442","DOIUrl":"https://doi.org/10.1109/RTAS.2015.7108442","url":null,"abstract":"Multicore systems are being increasingly used for embedded system deployments, even in safety-critical domains. Co-hosting applications of different criticality levels in the same platform requires sufficient isolation among them, which has given rise to the mixed-criticality scheduling problem and several recently proposed policies. Such policies typically employ runtime mechanisms to monitor task execution, detect exceptional events like task overruns, and react by switching scheduling mode. Implementing such mechanisms efficiently is crucial for any scheduler to detect runtime events and react in a timely manner, without compromising the system's safety. This paper investigates implementation alternatives for these mechanisms and empirically evaluates the effect of their runtime overhead on the schedulability of mixed-criticality applications. Specifically, we implement in user-space two state-of-the-art scheduling policies: the flexible time-triggered FTTS [1] and the partitioned EDFVD [2], and measure their runtime overheads on a 60-core Intel R Xeon Phi and a 4-core Intel R Core i5 for the first time. Based on extensive executions of synthetic task sets and an industrial avionic application, we show that these overheads cannot be neglected, esp. on massively multicore architectures, where they can incur a schedulability loss up to 97%. Evaluating runtime mechanisms early in the design phase and integrating their overheads into schedulability analysis seem therefore inevitable steps in the design of mixed-criticality systems. The need for verifiably bounded overheads motivates the development of novel timing-predictable architectures and runtime environments specifically targeted for mixed-criticality applications.","PeriodicalId":320300,"journal":{"name":"21st IEEE Real-Time and Embedded Technology and Applications Symposium","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125393807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-04-13DOI: 10.1109/RTAS.2015.7108391
Shrinivas Anand Panchamukhi, F. Mueller
The translation look aside buffer (TLB) improves the performance of systems by caching the virtual page to physical frame mapping. But TLBs present a source of unpredictability for real-time systems. Standard heap allocated regions do not provide guarantees on the TLB set that will hold a particular page translation. This unpredictability can lead to TLB misses with a penalty of up to thousands of cycles and consequently intra- and inter-task interference resulting in loose bounds on the worst case execution time (WCET) and TLB-related preemption delay. In this work, we design and implement a new heap allocator that guarantees the TLB set, which will hold a particular page translation on a uniprocessor of a contemporary architecture. The allocator is based on the concept of page coloring, a software TLB partitioning method. Virtual pages are colored such that two pages of different color cannot map to the same TLB set. Our experimental evaluations confirm the unpredictability associated with the standard heap allocation. Using a set of synthetic and standard benchmarks, we show that our allocator provides task isolation for real-time tasks. To the best of our knowledge, such TLB isolation without special hardware support is unprecedented, increases TLB predictability and can facilitate WCET analysis.
{"title":"Providing task isolation via TLB coloring","authors":"Shrinivas Anand Panchamukhi, F. Mueller","doi":"10.1109/RTAS.2015.7108391","DOIUrl":"https://doi.org/10.1109/RTAS.2015.7108391","url":null,"abstract":"The translation look aside buffer (TLB) improves the performance of systems by caching the virtual page to physical frame mapping. But TLBs present a source of unpredictability for real-time systems. Standard heap allocated regions do not provide guarantees on the TLB set that will hold a particular page translation. This unpredictability can lead to TLB misses with a penalty of up to thousands of cycles and consequently intra- and inter-task interference resulting in loose bounds on the worst case execution time (WCET) and TLB-related preemption delay. In this work, we design and implement a new heap allocator that guarantees the TLB set, which will hold a particular page translation on a uniprocessor of a contemporary architecture. The allocator is based on the concept of page coloring, a software TLB partitioning method. Virtual pages are colored such that two pages of different color cannot map to the same TLB set. Our experimental evaluations confirm the unpredictability associated with the standard heap allocation. Using a set of synthetic and standard benchmarks, we show that our allocator provides task isolation for real-time tasks. To the best of our knowledge, such TLB isolation without special hardware support is unprecedented, increases TLB predictability and can facilitate WCET analysis.","PeriodicalId":320300,"journal":{"name":"21st IEEE Real-Time and Embedded Technology and Applications Symposium","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125091731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-04-13DOI: 10.1109/RTAS.2015.7108422
Patrick Lazik, N. Rajagopal, B. Sinopoli, Anthony G. Rowe
In this paper, we present the design and evaluation of a platform that can be used for time synchronization and indoor positioning of mobile devices. The platform uses the Time-Difference-Of-Arrival (TDOA) of multiple ultrasonic chirps broadcast from a network of beacons placed throughout the environment to find an initial location as well as synchronize a receiver's clock with the infrastructure. These chirps encode identification data and ranging information that can be used to compute the receiver's location. Once the clocks have been synchronized, the system can continue performing localization directly using Time-of-Flight (TOF) ranging as opposed to TDOA. This provides similar position accuracy with fewer beacons (for tens of minutes) until the mobile device clock drifts enough that a TDOA signal is once again required. Our hardware platform uses RF-based time synchronization to distribute clock synchronization from a subset of infrastructure beacons connected to a GPS source. Mobile devices use a novel time synchronization technique leverages the continuously free-running audio sampling subsystem of a smartphone to synchronize with global time. Once synchronized, each device can determine an accurate proximity from as little as one beacon using TOF measurements. This significantly decreases the number of beacons required to cover an indoor space and improves performance in the face of obstructions. We show through experiments that this approach outperforms the Network Time Protocol (NTP) on smartphones by an order of magnitude, providing an average 720μs synchronization accuracy with clock drift rates as low as 2ppm.
{"title":"Ultrasonic time synchronization and ranging on smartphones","authors":"Patrick Lazik, N. Rajagopal, B. Sinopoli, Anthony G. Rowe","doi":"10.1109/RTAS.2015.7108422","DOIUrl":"https://doi.org/10.1109/RTAS.2015.7108422","url":null,"abstract":"In this paper, we present the design and evaluation of a platform that can be used for time synchronization and indoor positioning of mobile devices. The platform uses the Time-Difference-Of-Arrival (TDOA) of multiple ultrasonic chirps broadcast from a network of beacons placed throughout the environment to find an initial location as well as synchronize a receiver's clock with the infrastructure. These chirps encode identification data and ranging information that can be used to compute the receiver's location. Once the clocks have been synchronized, the system can continue performing localization directly using Time-of-Flight (TOF) ranging as opposed to TDOA. This provides similar position accuracy with fewer beacons (for tens of minutes) until the mobile device clock drifts enough that a TDOA signal is once again required. Our hardware platform uses RF-based time synchronization to distribute clock synchronization from a subset of infrastructure beacons connected to a GPS source. Mobile devices use a novel time synchronization technique leverages the continuously free-running audio sampling subsystem of a smartphone to synchronize with global time. Once synchronized, each device can determine an accurate proximity from as little as one beacon using TOF measurements. This significantly decreases the number of beacons required to cover an indoor space and improves performance in the face of obstructions. We show through experiments that this approach outperforms the Network Time Protocol (NTP) on smartphones by an order of magnitude, providing an average 720μs synchronization accuracy with clock drift rates as low as 2ppm.","PeriodicalId":320300,"journal":{"name":"21st IEEE Real-Time and Embedded Technology and Applications Symposium","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122660956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-04-13DOI: 10.1109/RTAS.2015.7108435
Alexander Zuepke, Marc Bommert, D. Lohmann
This paper presents AUTOBEST, a united AUTOSAR-OS and ARINC 653 RTOS kernel that addresses the requirements of both automotive and avionics domains. We show that their domain-specific requirements have a common basis and can be implemented with a small partitioning microkernel-based design on embedded microcontrollers with memory protection (MPU) support. While both, AUTOSAR and ARINC 653, use a unified task model in the kernel, we address their differences in dedicated user space libraries. Based on the kernel abstractions of futexes and lazy priority switching, these libraries provide domain specific synchronization mechanisms. Our results show that thereby it is possible to get the best of both worlds: AUTOBEST combines avionics safety with the resource-efficiency known from automotive systems.
{"title":"AUTOBEST: a united AUTOSAR-OS and ARINC 653 kernel","authors":"Alexander Zuepke, Marc Bommert, D. Lohmann","doi":"10.1109/RTAS.2015.7108435","DOIUrl":"https://doi.org/10.1109/RTAS.2015.7108435","url":null,"abstract":"This paper presents AUTOBEST, a united AUTOSAR-OS and ARINC 653 RTOS kernel that addresses the requirements of both automotive and avionics domains. We show that their domain-specific requirements have a common basis and can be implemented with a small partitioning microkernel-based design on embedded microcontrollers with memory protection (MPU) support. While both, AUTOSAR and ARINC 653, use a unified task model in the kernel, we address their differences in dedicated user space libraries. Based on the kernel abstractions of futexes and lazy priority switching, these libraries provide domain specific synchronization mechanisms. Our results show that thereby it is possible to get the best of both worlds: AUTOBEST combines avionics safety with the resource-efficiency known from automotive systems.","PeriodicalId":320300,"journal":{"name":"21st IEEE Real-Time and Embedded Technology and Applications Symposium","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124157829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-04-13DOI: 10.1109/RTAS.2015.7108413
Zhenkai Zhang, X. Koutsoukos
In many multi-core architectures, inclusive shared caches are used to reduce cache coherence complexity. However, the enforcement of the inclusion property can cause invalidation of memory blocks at higher cache levels. In order to ensure safety, analysis of cache hierarchies with inclusive caches for worst-case execution time (WCET) estimation is typically based on conservative decisions. Thus, the estimation may not be tight. In order to tighten the estimation, this paper proposes an approach that can more precisely analyze the behavior of a cache hierarchy maintaining the inclusion property. We illustrate the approach in the context of multi-level instruction caches. The approach first analyzes all the inclusive caches in the hierarchy in a bottom-up direction, and then analyzes the remaining non-inclusive caches in a top-down direction. In order to capture the inclusion victims and their effects, we also propose a concept of aging barrier and integrate it with the traditional must and persistence analyses to safely slow down their aging process so as to derive more precise analyses. We evaluate the proposed approach on a set of benchmarks and the evaluation reveals that the estimations are tightened.
{"title":"Top-down and bottom-up multi-level cache analysis for WCET estimation","authors":"Zhenkai Zhang, X. Koutsoukos","doi":"10.1109/RTAS.2015.7108413","DOIUrl":"https://doi.org/10.1109/RTAS.2015.7108413","url":null,"abstract":"In many multi-core architectures, inclusive shared caches are used to reduce cache coherence complexity. However, the enforcement of the inclusion property can cause invalidation of memory blocks at higher cache levels. In order to ensure safety, analysis of cache hierarchies with inclusive caches for worst-case execution time (WCET) estimation is typically based on conservative decisions. Thus, the estimation may not be tight. In order to tighten the estimation, this paper proposes an approach that can more precisely analyze the behavior of a cache hierarchy maintaining the inclusion property. We illustrate the approach in the context of multi-level instruction caches. The approach first analyzes all the inclusive caches in the hierarchy in a bottom-up direction, and then analyzes the remaining non-inclusive caches in a top-down direction. In order to capture the inclusion victims and their effects, we also propose a concept of aging barrier and integrate it with the traditional must and persistence analyses to safely slow down their aging process so as to derive more precise analyses. We evaluate the proposed approach on a set of benchmarks and the evaluation reveals that the estimations are tightened.","PeriodicalId":320300,"journal":{"name":"21st IEEE Real-Time and Embedded Technology and Applications Symposium","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121533581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-04-13DOI: 10.1109/RTAS.2015.7108444
R. Pathan
This paper proposes a new preemptive scheduling algorithm, called Fixed-Priority with Priority Promotion (FPP), for scheduling sporadic tasks on uni- and multiprocessor platform. In FPP scheduling, tasks are executed similar to traditional fixed-priority (FP) scheduling but the priority of some tasks may be promoted at fixed time interval (called, promotion point) relative to the release time of each job. A policy called Increase Priority at Deadline Difference (IPDD) to compute the promotion points and promoted priorities for each task is proposed. It is shown that when all tasks' priorities are governed under IPDD policy, then FPP scheduling essentially prioritizes jobs according to Earliest-Deadline-First (EDF) priority. It is known that inserting and removing jobs to and from the ready queue of traditional EDF scheduler is more complex and has higher overhead than that of FP scheduler. To avoid such problem in FPP scheduling, a simple data structure and efficient operations to insert and remove jobs to and from the ready queue are proposed. Finally, an effective scheme to reduce overhead due to priority promotion is proposed: if a task set is not schedulable using traditional FP scheduling, then promotion points are assigned only to those tasks that need them to meet the deadlines; otherwise, tasks are assigned fixed priorities without any priority promotion and executed same as traditional FP scheduling. Empirical investigation shows the effectiveness of the proposed scheme in reducing overhead on uniprocessor and in accepting larger number of task sets in comparison to that of using state-of-the-art global schedulability tests for multiprocessors.
针对单处理器和多处理器平台上的零星任务调度问题,提出了一种新的抢占调度算法——固定优先级优先级提升算法(FPP)。在FPP调度中,任务的执行与传统的固定优先级调度类似,但相对于每个任务的释放时间,某些任务的优先级可以在固定的时间间隔(称为提升点)内提升。提出了一种名为IPDD (Increase Priority at Deadline Difference)的策略来计算每个任务的提升点和提升优先级。结果表明,当所有任务的优先级都在IPDD策略控制下时,FPP调度本质上是根据EDF优先级对作业进行优先级排序。众所周知,传统的EDF调度器在就绪队列中插入和删除作业比FP调度器更复杂,开销也更高。为了避免在FPP调度中出现这样的问题,提出了一种简单的数据结构和高效的作业插入和从就绪队列中移除操作。最后,提出了一种有效的降低优先级提升带来的开销的方案:如果使用传统的FP调度,任务集是不可调度的,那么提升点只分配给那些需要在截止日期前完成的任务;否则,任务被分配固定的优先级,没有任何优先级提升,并按照传统的FP调度方式执行。实证研究表明,与使用最先进的多处理器全局可调度性测试相比,所提出的方案在减少单处理器开销和接受更多任务集方面是有效的。
{"title":"Unifying fixed- and dynamic-priority scheduling based on priority promotion and an improved ready queue management technique","authors":"R. Pathan","doi":"10.1109/RTAS.2015.7108444","DOIUrl":"https://doi.org/10.1109/RTAS.2015.7108444","url":null,"abstract":"This paper proposes a new preemptive scheduling algorithm, called Fixed-Priority with Priority Promotion (FPP), for scheduling sporadic tasks on uni- and multiprocessor platform. In FPP scheduling, tasks are executed similar to traditional fixed-priority (FP) scheduling but the priority of some tasks may be promoted at fixed time interval (called, promotion point) relative to the release time of each job. A policy called Increase Priority at Deadline Difference (IPDD) to compute the promotion points and promoted priorities for each task is proposed. It is shown that when all tasks' priorities are governed under IPDD policy, then FPP scheduling essentially prioritizes jobs according to Earliest-Deadline-First (EDF) priority. It is known that inserting and removing jobs to and from the ready queue of traditional EDF scheduler is more complex and has higher overhead than that of FP scheduler. To avoid such problem in FPP scheduling, a simple data structure and efficient operations to insert and remove jobs to and from the ready queue are proposed. Finally, an effective scheme to reduce overhead due to priority promotion is proposed: if a task set is not schedulable using traditional FP scheduling, then promotion points are assigned only to those tasks that need them to meet the deadlines; otherwise, tasks are assigned fixed priorities without any priority promotion and executed same as traditional FP scheduling. Empirical investigation shows the effectiveness of the proposed scheme in reducing overhead on uniprocessor and in accepting larger number of task sets in comparison to that of using state-of-the-art global schedulability tests for multiprocessors.","PeriodicalId":320300,"journal":{"name":"21st IEEE Real-Time and Embedded Technology and Applications Symposium","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132731126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-04-13DOI: 10.1109/RTAS.2015.7108420
Husheng Zhou, G. Tong, Cong Liu
Graphics processing units (GPUs) are being widely used as co-processors in many application domains to accelerate general-purpose workloads that are computationally intensive, known as GPGPU computing. Real-time multi-tasking support is a critical requirement for many emerging GPGPU computing domains. However, due to the asynchronous and non-preemptive nature of GPU processing, in multi-tasking environments, tasks with higher priority may be blocked by lower priority tasks for a lengthy duration. This severely harms the system's timing predictability and is a serious impediment limiting the applicability of GPGPU in many real-time and embedded systems. In this paper, we present an efficient GPGPU preemptive execution system (GPES), which combines user-level and driverlevel runtime engines to reduce the pending time of high-priority GPGPU tasks that may be blocked by long-freezing low-priority competing workloads. GPES automatically slices a long-running kernel execution into multiple subkernel launches and splits data transaction into multiple chunks at user-level, then inserts preemption points between subkernel launches and memorycopy operations at driver-level. We implement a prototype of GPES, and use real-world benchmarks and case studies for evaluation. Experimental results demonstrate that GPES is able to reduce the pending time of high-priority tasks in a multitasking environment by up to 90% over the existing GPU driver solutions, while introducing small overheads.
{"title":"GPES: a preemptive execution system for GPGPU computing","authors":"Husheng Zhou, G. Tong, Cong Liu","doi":"10.1109/RTAS.2015.7108420","DOIUrl":"https://doi.org/10.1109/RTAS.2015.7108420","url":null,"abstract":"Graphics processing units (GPUs) are being widely used as co-processors in many application domains to accelerate general-purpose workloads that are computationally intensive, known as GPGPU computing. Real-time multi-tasking support is a critical requirement for many emerging GPGPU computing domains. However, due to the asynchronous and non-preemptive nature of GPU processing, in multi-tasking environments, tasks with higher priority may be blocked by lower priority tasks for a lengthy duration. This severely harms the system's timing predictability and is a serious impediment limiting the applicability of GPGPU in many real-time and embedded systems. In this paper, we present an efficient GPGPU preemptive execution system (GPES), which combines user-level and driverlevel runtime engines to reduce the pending time of high-priority GPGPU tasks that may be blocked by long-freezing low-priority competing workloads. GPES automatically slices a long-running kernel execution into multiple subkernel launches and splits data transaction into multiple chunks at user-level, then inserts preemption points between subkernel launches and memorycopy operations at driver-level. We implement a prototype of GPES, and use real-world benchmarks and case studies for evaluation. Experimental results demonstrate that GPES is able to reduce the pending time of high-priority tasks in a multitasking environment by up to 90% over the existing GPU driver solutions, while introducing small overheads.","PeriodicalId":320300,"journal":{"name":"21st IEEE Real-Time and Embedded Technology and Applications Symposium","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133403921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}