首页 > 最新文献

International Conference on Hardware/Software Codesign and System Synthesis最新文献

英文 中文
Applying network calculus for performance analysis of self-similar traffic in on-chip networks 将网络演算应用于片上网络自相似流量的性能分析
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629497
Yue Qian, Zhonghai Lu, Wenhua Dou
On-chip traffic of many applications exhibits self-similar characteristics. In this paper, we intend to apply network calculus to analyze the delay and backlog bounds for self-similar traffic in networks on chips. We first prove that self-similar traffic can not be constrained by any deterministic arrival curve. Then we prove that self-similar traffic can be constrained by deterministic linear arrival curves α{r,b}(t)=rt+b (r:rate, b:burstiness) if an additional parameter, excess probability ε, is used to capture its burstiness exceeding the arrival envelope. This three-parameter model, ε-α{r,b}(t)=rt+b(ε), enables us to apply and extend the results of network calculus to analyze the performance and buffering cost of networks delivering self-similar traffic flows. Assuming the latency-rate server model for the network elements, we give closed-form equations to compute the delay and backlog bounds for self-similar traffic traversing a series of network elements. Furthermore, we describe a performance analysis flow with self-similar traffic as input. Our experimental results using real on-chip multimedia traffic traces validate our model and approach.
许多应用程序的片上流量表现出自相似的特性。在本文中,我们打算应用网络演算来分析芯片网络中自相似流量的延迟和积压边界。首先证明了自相似交通不受任何确定性到达曲线的约束。然后,我们证明了如果使用一个额外的参数,即超额概率ε来捕获其超过到达包络线的突发性,则自相似流量可以被确定性线性到达曲线α{r,b}(t)=rt+b (r:速率,b:突发性)约束。这个三参数模型ε-α{r,b}(t)=rt+b(ε),使我们能够应用和扩展网络演算的结果来分析传递自相似流量的网络的性能和缓冲成本。假设网络元的延迟率服务器模型,我们给出了计算自相似流量穿越一系列网络元的延迟和积压边界的封闭形式方程。此外,我们还描述了一个以自相似流量为输入的性能分析流。我们使用真实片上多媒体流量轨迹的实验结果验证了我们的模型和方法。
{"title":"Applying network calculus for performance analysis of self-similar traffic in on-chip networks","authors":"Yue Qian, Zhonghai Lu, Wenhua Dou","doi":"10.1145/1629435.1629497","DOIUrl":"https://doi.org/10.1145/1629435.1629497","url":null,"abstract":"On-chip traffic of many applications exhibits self-similar characteristics. In this paper, we intend to apply network calculus to analyze the delay and backlog bounds for self-similar traffic in networks on chips. We first prove that self-similar traffic can not be constrained by any deterministic arrival curve. Then we prove that self-similar traffic can be constrained by deterministic linear arrival curves α{r,b}(t)=rt+b (r:rate, b:burstiness) if an additional parameter, excess probability ε, is used to capture its burstiness exceeding the arrival envelope. This three-parameter model, ε-α{r,b}(t)=rt+b(ε), enables us to apply and extend the results of network calculus to analyze the performance and buffering cost of networks delivering self-similar traffic flows. Assuming the latency-rate server model for the network elements, we give closed-form equations to compute the delay and backlog bounds for self-similar traffic traversing a series of network elements. Furthermore, we describe a performance analysis flow with self-similar traffic as input. Our experimental results using real on-chip multimedia traffic traces validate our model and approach.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131799049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
FlexRay schedule optimization of the static segment FlexRay的静态分段调度优化
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629485
M. Lukasiewycz, M. Glaß, J. Teich, Paul Milbredt
The FlexRay bus is the prospective automotive standard communication system. For the sake of a high exibility, the protocol includes a static time-triggered and a dynamic event-triggered segment. This paper is dedicated to the scheduling of the static segment in compliance with the automotive-specific AUTOSAR standard. For the determination of an optimal schedule in terms of the number of used slots, a fast greedy heuristic as well as a complete approach based on Integer Linear Programming are presented. For this purpose, a scheme for the transformation of the scheduling problem into a bin packing problem is proposed. Moreover, a metric and optimization method for the extensibility of partially used slots is introduced. Finally, the provided experimental results give evidence of the benefits of the proposed methods. On a realistic case study, the proposed methods are capable of obtaining better results in a significantly smaller amount of time compared to a commercial tool. Additionally, the experimental results provide a case study on incremental scheduling, a scalability analysis, an exploration use case, and an additional test case to emphasis the robustness and exibility of the proposed methods.
FlexRay总线是未来的汽车标准通信系统。为了提高灵活性,协议包括静态时间触发段和动态事件触发段。本文研究了符合汽车专用AUTOSAR标准的静态路段调度问题。针对以槽数确定最优调度问题,提出了一种快速贪婪启发式算法和基于整数线性规划的完备方法。为此,提出了一种将调度问题转化为装箱问题的方案。此外,还提出了一种局部使用槽的可扩展性度量和优化方法。最后,给出的实验结果证明了所提方法的有效性。在实际的案例研究中,与商业工具相比,所提出的方法能够在更短的时间内获得更好的结果。此外,实验结果提供了增量调度的案例研究、可扩展性分析、探索用例和额外的测试用例,以强调所提出方法的鲁棒性和灵活性。
{"title":"FlexRay schedule optimization of the static segment","authors":"M. Lukasiewycz, M. Glaß, J. Teich, Paul Milbredt","doi":"10.1145/1629435.1629485","DOIUrl":"https://doi.org/10.1145/1629435.1629485","url":null,"abstract":"The FlexRay bus is the prospective automotive standard communication system. For the sake of a high exibility, the protocol includes a static time-triggered and a dynamic event-triggered segment. This paper is dedicated to the scheduling of the static segment in compliance with the automotive-specific AUTOSAR standard. For the determination of an optimal schedule in terms of the number of used slots, a fast greedy heuristic as well as a complete approach based on Integer Linear Programming are presented. For this purpose, a scheme for the transformation of the scheduling problem into a bin packing problem is proposed. Moreover, a metric and optimization method for the extensibility of partially used slots is introduced. Finally, the provided experimental results give evidence of the benefits of the proposed methods. On a realistic case study, the proposed methods are capable of obtaining better results in a significantly smaller amount of time compared to a commercial tool. Additionally, the experimental results provide a case study on incremental scheduling, a scalability analysis, an exploration use case, and an additional test case to emphasis the robustness and exibility of the proposed methods.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"229 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116446287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 121
FRA: a flash-aware redundancy array of flash storage devices FRA: flash存储设备的flash感知冗余阵列
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629459
Yangsup Lee, Sanghyuk Jung, Y. Song
Since flash memory has many attractive characteristics such as high performance, non-volatility, low power consumption and shock resistance, it has been widely used as storage media in the embedded and computer system environments. In the case of reliability, however, there are many shortcomings in flash memory: potentially high I/O latency due to erase-before-write and poor durability due to limited erase cycles. To overcome these problems, a RAID technique borrowed from storage technology based on hard disks is employed. In the RAID technology, multi-bit burst failures in the page, block or device are easily detected and corrected so that the reliability can be significantly enhanced. However the existing RAID-5 scheme for the flash-based storage has delayed response time for parity updating. To overcome this problem, we propose a novel approach using a RAID technique in flash storage, called Flash-aware Redundancy Array. In this approach, parity updates are postponed so that they are not included in the critical path of read and write operations. Instead, they are scheduled for when the device becomes idle. For example, the proposed scheme shows a 19% improvement in the average write response time, compared to other approaches.
由于闪存具有高性能、无易失性、低功耗和抗冲击等优点,已广泛应用于嵌入式和计算机系统环境中作为存储介质。然而,就可靠性而言,闪存存在许多缺点:由于写前擦除,可能存在较高的I/O延迟;由于擦除周期有限,持久性较差。为了克服这些问题,采用了借鉴基于硬盘的存储技术的RAID技术。在RAID技术中,可以很容易地检测和纠正页面、块或设备中的多位突发故障,从而大大提高了可靠性。然而,现有的基于闪存的RAID-5方案延迟了奇偶更新的响应时间。为了克服这个问题,我们提出了一种使用闪存中的RAID技术的新方法,称为闪存感知冗余阵列。在这种方法中,奇偶校验更新被延迟,因此它们不包括在读写操作的关键路径中。相反,它们是在设备空闲时安排的。例如,与其他方法相比,所提出的方案在平均写响应时间上提高了19%。
{"title":"FRA: a flash-aware redundancy array of flash storage devices","authors":"Yangsup Lee, Sanghyuk Jung, Y. Song","doi":"10.1145/1629435.1629459","DOIUrl":"https://doi.org/10.1145/1629435.1629459","url":null,"abstract":"Since flash memory has many attractive characteristics such as high performance, non-volatility, low power consumption and shock resistance, it has been widely used as storage media in the embedded and computer system environments. In the case of reliability, however, there are many shortcomings in flash memory: potentially high I/O latency due to erase-before-write and poor durability due to limited erase cycles. To overcome these problems, a RAID technique borrowed from storage technology based on hard disks is employed. In the RAID technology, multi-bit burst failures in the page, block or device are easily detected and corrected so that the reliability can be significantly enhanced. However the existing RAID-5 scheme for the flash-based storage has delayed response time for parity updating. To overcome this problem, we propose a novel approach using a RAID technique in flash storage, called Flash-aware Redundancy Array. In this approach, parity updates are postponed so that they are not included in the critical path of read and write operations. Instead, they are scheduled for when the device becomes idle. For example, the proposed scheme shows a 19% improvement in the average write response time, compared to other approaches.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122584699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 76
Using binary translation in event driven simulation for fast and flexible MPSoC simulation 在事件驱动仿真中使用二进制转换实现快速灵活的MPSoC仿真
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629446
M. Gligor, Nicolas Fournel, F. Pétrot
In this paper, we investigate the use of instruction set simulators (ISS) based on binary translation to accelerate full timed multiprocessor system simulation at transaction level. To have an accurate timing behavior, we had to firstly solve timing issues in processor modeling, secondly define fast and precise cache models, and thirdly solve the synchronization issues due to the different models of computation used in the ISSes and in the rest of the system. We present an integration solution that covers these issues and detail its implementation. We have experimented our proposal using processors models provided by the QEMU framework to replace the existing ISSes and SystemC TLM as simulation environment for the whole platform. This approach proposes a range of solutions trading off simulation speed versus accuracy. The experiments show that even for the most precise configuration, the simulation speedup is still significant.
本文研究了基于二进制转换的指令集模拟器(ISS)在事务级加速全时多处理器系统仿真的方法。为了获得准确的计时行为,我们必须首先解决处理器建模中的计时问题,其次定义快速精确的缓存模型,第三解决由于isse和系统其他部分使用的不同计算模型而导致的同步问题。我们提供了一个集成解决方案,涵盖了这些问题并详细介绍了其实现。我们使用QEMU框架提供的处理器模型来替代现有的isse和SystemC TLM作为整个平台的仿真环境,对我们的建议进行了实验。这种方法提出了一系列解决方案,以权衡模拟速度与准确性。实验表明,即使对于最精确的配置,仿真加速仍然是显著的。
{"title":"Using binary translation in event driven simulation for fast and flexible MPSoC simulation","authors":"M. Gligor, Nicolas Fournel, F. Pétrot","doi":"10.1145/1629435.1629446","DOIUrl":"https://doi.org/10.1145/1629435.1629446","url":null,"abstract":"In this paper, we investigate the use of instruction set simulators (ISS) based on binary translation to accelerate full timed multiprocessor system simulation at transaction level. To have an accurate timing behavior, we had to firstly solve timing issues in processor modeling, secondly define fast and precise cache models, and thirdly solve the synchronization issues due to the different models of computation used in the ISSes and in the rest of the system. We present an integration solution that covers these issues and detail its implementation. We have experimented our proposal using processors models provided by the QEMU framework to replace the existing ISSes and SystemC TLM as simulation environment for the whole platform. This approach proposes a range of solutions trading off simulation speed versus accuracy. The experiments show that even for the most precise configuration, the simulation speedup is still significant.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131571615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
Scalable and retargetable simulation techniquesfor multiprocessor systems 多处理器系统的可扩展和可重定向仿真技术
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629448
Heekyung Kim, Dukyoung Yun, S. Ha
For design space exploration of embedded systems, a virtual prototyping system is commonly used to verify the expected performance as well as functionality before a hardware prototype is built. For accurate performance estimation, a virtual prototyping system is constructed by replacing real processing components with component simulators running concurrently. In such a distributed simulation system, the overhead of communication and synchronization between the component simulators increases in proportion to the number of simulators in case the lock-step synchronization is used. As a result the simulation performance is degraded significantly as the number of processors integrated in a chip increases. To overcome this problem, we propose a scalable and retargetable simulation technique that boosts the simulation performance significantly, by attaching a simulator wrapper to each component simulator. The simulator wrapper performs synchronization on behalf of the associated simulator itself between the simulators and the simulation backplane. Use of the simulator wrapper also makes the proposed simulation platform retargetable since a third-party simulator like ARMulator can be integrated into the simulation environment through a wrapper without modification. In addition, it enables parallel simulation that achieves almost linear speed-up as the number of processor cores increases in the simulation host. Through experiments with multimedia CODEC application and other applications varying the number of processor simulators from 1 to 16, it is proved that the simulation performance remains constant. And scalable performance from parallel simulation is also confirmed by experiments.
在探索嵌入式系统的设计空间时,通常使用虚拟原型系统来验证预期性能和功能,然后再构建硬件原型。为了准确估算性能,虚拟原型系统是通过用并发运行的组件模拟器替代真实处理组件来构建的。在这种分布式仿真系统中,如果使用锁步同步,组件仿真器之间的通信和同步开销会随着仿真器数量的增加而成正比增加。因此,随着集成在芯片中的处理器数量的增加,仿真性能会明显下降。为了克服这一问题,我们提出了一种可扩展、可重定向的仿真技术,通过为每个组件仿真器附加一个仿真器包装器来显著提高仿真性能。仿真器包装器代表相关仿真器本身在仿真器和仿真背板之间执行同步。仿真器包装器的使用还使拟议的仿真平台具有可重定向性,因为第三方仿真器(如 ARMulator)无需修改即可通过包装器集成到仿真环境中。此外,它还实现了并行仿真,随着仿真主机中处理器内核数量的增加,仿真速度几乎呈线性提升。通过对多媒体 CODEC 应用程序和其他应用程序进行实验,将处理器模拟器的数量从 1 个增加到 16 个,结果证明模拟性能保持不变。实验还证实了并行仿真的可扩展性能。
{"title":"Scalable and retargetable simulation techniquesfor multiprocessor systems","authors":"Heekyung Kim, Dukyoung Yun, S. Ha","doi":"10.1145/1629435.1629448","DOIUrl":"https://doi.org/10.1145/1629435.1629448","url":null,"abstract":"For design space exploration of embedded systems, a virtual prototyping system is commonly used to verify the expected performance as well as functionality before a hardware prototype is built. For accurate performance estimation, a virtual prototyping system is constructed by replacing real processing components with component simulators running concurrently. In such a distributed simulation system, the overhead of communication and synchronization between the component simulators increases in proportion to the number of simulators in case the lock-step synchronization is used. As a result the simulation performance is degraded significantly as the number of processors integrated in a chip increases. To overcome this problem, we propose a scalable and retargetable simulation technique that boosts the simulation performance significantly, by attaching a simulator wrapper to each component simulator. The simulator wrapper performs synchronization on behalf of the associated simulator itself between the simulators and the simulation backplane. Use of the simulator wrapper also makes the proposed simulation platform retargetable since a third-party simulator like ARMulator can be integrated into the simulation environment through a wrapper without modification. In addition, it enables parallel simulation that achieves almost linear speed-up as the number of processor cores increases in the simulation host. Through experiments with multimedia CODEC application and other applications varying the number of processor simulators from 1 to 16, it is proved that the simulation performance remains constant. And scalable performance from parallel simulation is also confirmed by experiments.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131064142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Supporting RTL flow compatibility in a microarchitecture-level design framework 在微架构级设计框架中支持RTL流兼容性
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629482
Daniel Schwartz-Narbonne, C. Chan, Yogesh S. Mahajan, S. Malik
Current RTL-based design methodologies face significant scaling challenges related to the difficulty of designing, modifying, and verifying RTL. RTL contains primarily low level structural information about the design. In contrast, the microarchitecture-level is much closer to the specification level, making it an effective entry point for hardware design. The explicit description of the high-level units of work is also beneficial for verification. Currently used models for high level design have very complex semantics. In this paper, we present a microarchitectural modeling language with simpler semantics. We demonstrate that it results in a significantly simpler synthesis to Verilog, providing for integration with existing RTL flows. Moreover, the simple semantics of the model enable the generation of PSL assertions for functionally verifying correctness of the synthesis. We demonstrate the efficacy of this approach through two case-studies---a router switch and a processor design. We synthesized both designs, and formally verified the synthesis using the generated assertions.
当前基于RTL的设计方法面临着与设计、修改和验证RTL困难相关的重大扩展挑战。RTL主要包含有关设计的低级结构信息。相比之下,微体系结构级别更接近规范级别,使其成为硬件设计的有效切入点。对高级工作单元的明确描述也有利于验证。目前用于高层设计的模型具有非常复杂的语义。在本文中,我们提出了一种语义更简单的微架构建模语言。我们证明,它可以大大简化对Verilog的合成,并提供与现有RTL流的集成。此外,该模型的简单语义支持生成PSL断言,以便在功能上验证合成的正确性。我们通过两个案例研究证明了这种方法的有效性——路由器开关和处理器设计。我们综合了这两种设计,并使用生成的断言正式验证了综合。
{"title":"Supporting RTL flow compatibility in a microarchitecture-level design framework","authors":"Daniel Schwartz-Narbonne, C. Chan, Yogesh S. Mahajan, S. Malik","doi":"10.1145/1629435.1629482","DOIUrl":"https://doi.org/10.1145/1629435.1629482","url":null,"abstract":"Current RTL-based design methodologies face significant scaling challenges related to the difficulty of designing, modifying, and verifying RTL. RTL contains primarily low level structural information about the design. In contrast, the microarchitecture-level is much closer to the specification level, making it an effective entry point for hardware design. The explicit description of the high-level units of work is also beneficial for verification. Currently used models for high level design have very complex semantics. In this paper, we present a microarchitectural modeling language with simpler semantics. We demonstrate that it results in a significantly simpler synthesis to Verilog, providing for integration with existing RTL flows. Moreover, the simple semantics of the model enable the generation of PSL assertions for functionally verifying correctness of the synthesis. We demonstrate the efficacy of this approach through two case-studies---a router switch and a processor design. We synthesized both designs, and formally verified the synthesis using the generated assertions.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117158107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Automatic customization of device drivers for IP-cores used with assorted CPU organizations 自动定制与各种CPU组织一起使用的ip核的设备驱动程序
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629460
A. Acquaviva, N. Bombieri, F. Fummi, S. Vinco
Plugging an IP core into an embedded platform implies the generation of a device driver complying with the IP communication protocol from one side and with the CPU organization (i.e., single processor, SMP, AMP) from the other side. Reusing an existent driver developed for a different CPU organization needs a time-consuming and error-prone manual customization of it, that discourages the evaluation of alternative target platform organizations. In this context, the paper firstly proposes to extract the formal model of the IP communication protocol from the RTL testbench provided with it. Then a taxonomy of device drivers is presented based on the CPU organization of the platform. This taxonomy allows to select the correct template used to automatically generate a device driver compliant with the CPU organization, with the use in a simulated or in a real platform, with the interrupt support, with the operating system, with the I/O architecture and with the possible parallel execution. The proposed methodology has been successfully tested on a family of embedded platforms with different CPU organizations.
将IP核插入嵌入式平台意味着从一端生成符合IP通信协议的设备驱动程序,从另一端生成符合CPU组织(即单处理器、SMP、AMP)的设备驱动程序。重用为不同CPU组织开发的现有驱动程序需要对其进行耗时且容易出错的手动定制,这不利于对可选目标平台组织进行评估。在此背景下,本文首先提出从所提供的RTL测试台中提取IP通信协议的形式化模型。然后根据平台的CPU组织结构对设备驱动程序进行了分类。这种分类法允许选择正确的模板来自动生成与CPU组织、模拟或真实平台、中断支持、操作系统、I/O体系结构和可能的并行执行兼容的设备驱动程序。所提出的方法已在一系列具有不同CPU组织的嵌入式平台上成功地进行了测试。
{"title":"Automatic customization of device drivers for IP-cores used with assorted CPU organizations","authors":"A. Acquaviva, N. Bombieri, F. Fummi, S. Vinco","doi":"10.1145/1629435.1629460","DOIUrl":"https://doi.org/10.1145/1629435.1629460","url":null,"abstract":"Plugging an IP core into an embedded platform implies the generation of a device driver complying with the IP communication protocol from one side and with the CPU organization (i.e., single processor, SMP, AMP) from the other side. Reusing an existent driver developed for a different CPU organization needs a time-consuming and error-prone manual customization of it, that discourages the evaluation of alternative target platform organizations. In this context, the paper firstly proposes to extract the formal model of the IP communication protocol from the RTL testbench provided with it. Then a taxonomy of device drivers is presented based on the CPU organization of the platform. This taxonomy allows to select the correct template used to automatically generate a device driver compliant with the CPU organization, with the use in a simulated or in a real platform, with the interrupt support, with the operating system, with the I/O architecture and with the possible parallel execution. The proposed methodology has been successfully tested on a family of embedded platforms with different CPU organizations.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130082765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Bottom-up performance analysis considering time slice based software scheduling at system level 考虑基于时间片的系统级软件调度的自底向上性能分析
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629493
A. Viehl, M. Pressler, O. Bringmann
In this paper, a novel approach for integrating time slice based resource access in global performance analysis of distributed real-time critical embedded systems is presented. The performance analysis approach itself is based on bottom-up analysis of communicating processes under consideration of synchronization by inter-process communication and complex internal control flows of the processes. This general analysis methodology is extended concerning concurrent occupation of shared resources using time slice based access methods. The defined extensions are parameterizable for describing arbitrary communication media access schedules and software schedules on shared computation resources, although the explicit focus in this paper is on software scheduling. The applicability of the analysis extensions is presented by a case study of a multimedia subsystem implemented in SystemC.
本文提出了一种将基于时间片的资源访问集成到分布式实时关键嵌入式系统全局性能分析中的新方法。性能分析方法本身基于自底向上的通信过程分析,同时考虑到进程间通信的同步以及进程复杂的内部控制流。将这种通用分析方法扩展到使用基于时间片的访问方法并发占用共享资源。定义的扩展是可参数化的,用于描述任意通信媒体访问调度和共享计算资源上的软件调度,尽管本文明确的重点是软件调度。通过一个在SystemC中实现的多媒体子系统的案例研究,说明了分析扩展的适用性。
{"title":"Bottom-up performance analysis considering time slice based software scheduling at system level","authors":"A. Viehl, M. Pressler, O. Bringmann","doi":"10.1145/1629435.1629493","DOIUrl":"https://doi.org/10.1145/1629435.1629493","url":null,"abstract":"In this paper, a novel approach for integrating time slice based resource access in global performance analysis of distributed real-time critical embedded systems is presented. The performance analysis approach itself is based on bottom-up analysis of communicating processes under consideration of synchronization by inter-process communication and complex internal control flows of the processes. This general analysis methodology is extended concerning concurrent occupation of shared resources using time slice based access methods. The defined extensions are parameterizable for describing arbitrary communication media access schedules and software schedules on shared computation resources, although the explicit focus in this paper is on software scheduling. The applicability of the analysis extensions is presented by a case study of a multimedia subsystem implemented in SystemC.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120946894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
ESL power analysis of embedded processors for temperature and reliability estimations 用于温度和可靠性估计的嵌入式处理器的ESL功率分析
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629469
Björn Sander, Jürgen Schnerr, O. Bringmann
The ongoing scaling of CMOS technology facilitates the design of systems with continuously increasing functionality but also raises the susceptibility of these systems to reliability issues caused by high power densities and temperatures, respectively. Because of complexity reasons, the Electronic System Level (ESL) is gaining importance as starting point of design. Design alternatives are evaluated at ESL with respect to several design objectives, lately also including temperature. But temperatures are dominated by local power effects - a fact, that has not been sufficiently reflected at ESL until now. There is a lack of appropriate models, which we call ESL Power Density Gap. The contributions of this paper are twofold. First, we describe why the ESL Power Density Gap should be closed. In doing so, we want to stimulate a discussion. After that, we introduce a new ESL methodology for the power analysis of embedded processors, which can be considered as a first step to solve the aforementioned problem. It allows the generation of executable system models from a platform description, combining a functionality representation and component characterizations. Using an example application, it is shown that high power densities, usually invisible at ESL, can be uncovered by applying the proposed approach.
CMOS技术的持续扩展促进了系统功能的不断增加,但也增加了这些系统对高功率密度和高温度引起的可靠性问题的敏感性。由于复杂性的原因,电子系统级(ESL)作为设计的出发点越来越重要。ESL根据几个设计目标对设计方案进行评估,最近还包括温度。但是温度是由当地的电力影响决定的——这一事实直到现在还没有在ESL得到充分的反映。缺乏合适的模型,我们称之为ESL功率密度差距。本文的贡献是双重的。首先,我们描述了为什么ESL的功率密度差距应该被关闭。这样做,我们想激发讨论。之后,我们介绍了一种新的ESL方法用于嵌入式处理器的功耗分析,这可以被认为是解决上述问题的第一步。它允许从平台描述中生成可执行的系统模型,结合功能表示和组件特征。通过一个示例应用,表明高功率密度(通常在ESL中不可见)可以通过应用所提出的方法来发现。
{"title":"ESL power analysis of embedded processors for temperature and reliability estimations","authors":"Björn Sander, Jürgen Schnerr, O. Bringmann","doi":"10.1145/1629435.1629469","DOIUrl":"https://doi.org/10.1145/1629435.1629469","url":null,"abstract":"The ongoing scaling of CMOS technology facilitates the design of systems with continuously increasing functionality but also raises the susceptibility of these systems to reliability issues caused by high power densities and temperatures, respectively. Because of complexity reasons, the Electronic System Level (ESL) is gaining importance as starting point of design. Design alternatives are evaluated at ESL with respect to several design objectives, lately also including temperature. But temperatures are dominated by local power effects - a fact, that has not been sufficiently reflected at ESL until now. There is a lack of appropriate models, which we call ESL Power Density Gap. The contributions of this paper are twofold. First, we describe why the ESL Power Density Gap should be closed. In doing so, we want to stimulate a discussion. After that, we introduce a new ESL methodology for the power analysis of embedded processors, which can be considered as a first step to solve the aforementioned problem. It allows the generation of executable system models from a platform description, combining a functionality representation and component characterizations. Using an example application, it is shown that high power densities, usually invisible at ESL, can be uncovered by applying the proposed approach.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130520264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
MinDeg: a performance-guided replacement policy for run-time reconfigurable accelerators MinDeg:运行时可重构加速器的性能导向替换策略
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629481
L. Bauer, M. Shafique, J. Henkel
Reconfigurable Processors utilize a reconfigurable fabric (to implement application-specific accelerators) and may perform run-time reconfigurations to exchange the set of deployed accelerators during application execution. Depending on the application requirements, the high utilization of the reconfigurable fabric (due to run-time reconfiguration) leads to a performance improvement compared to non-reconfigurable application-specific processors (ASIPs). However, as the reconfiguration time of fine-grained reconfigurable fabrics (i.e. FPGA-like structures) is rather long (in the range of milliseconds), it is crucial to avoid frequent cycles of reconfiguration-replacement-reconfiguration of the accelerators in order to exploit the real benefits of Reconfigurable Processors. Similar to memory caches, a replacement policy has to decide which reconfigurable accelerators shall be replaced in order to reconfigure additional accelerators. In the case that a recently replaced accelerator is demanded again, the reconfiguration delay might noticeably increase the application execution time. In this paper, we demonstrate that well-known policies for cache and page replacement (typically also used in state-of-the-art Reconfigurable Processors) are not generally suitable to replace reconfigurable accelerators. We therefore propose our novel performance-guided Minimum Degradation (MinDeg) replacement policy that particularly targets Reconfigurable Processors and replaces reconfigurable accelerators at run time. It accounts for the performance penalty that occurs due to replacement of a certain accelerator. Comparisons with the most-prominent replacement policies show the superiority of our approach. We evaluate and compare MinDeg for a wide range of different reconfiguration bandwidths and reconfigurable fabric sizes and achieve a speedup of up to 2.26x (1.74x compared to the widely used LRU policy). The introduced overhead to achieve this speedup is minor in comparison to the obtained application acceleration, i.e. the highest observed overhead (to calculate our MinDeg replacement policy) affected the obtained application acceleration by only 0.30%. A parallel hardware implementation of our MinDeg algorithm demands only 4,440 gate equivalents, which corresponds to 64% of the average requirements of one real-world reconfigurable accelerator (note: multiple accelerators are demanded per kernel). However, our MinDeg policy does not rely on hardware support, i.e. a trade-off between the hardware requirements and the acceleration is possible.
可重构处理器利用可重构结构(实现特定于应用程序的加速器),并且可以在应用程序执行期间执行运行时重新配置以交换部署的加速器集。根据应用程序需求,可重构结构的高利用率(由于运行时重新配置)与不可重构的特定于应用程序的处理器(asip)相比,可以提高性能。然而,由于细粒度可重构结构(即类fpga结构)的重构时间相当长(在毫秒范围内),为了利用可重构处理器的真正优势,避免加速器的频繁重构-替换-重构循环至关重要。与内存缓存类似,替换策略必须决定替换哪些可重新配置的加速器,以便重新配置其他加速器。如果再次需要最近替换的加速器,重新配置延迟可能会显著增加应用程序的执行时间。在本文中,我们证明了众所周知的缓存和页面替换策略(通常也用于最先进的可重构处理器)通常不适合替换可重构加速器。因此,我们提出了新的以性能为导向的最小退化(MinDeg)替换策略,该策略特别针对可重构处理器,并在运行时替换可重构加速器。它解释了由于更换某个加速器而产生的性能损失。与最突出的替代政策相比,我们的方法具有优越性。我们评估和比较了MinDeg在各种不同的重构带宽和重构结构尺寸上的加速,并实现了高达2.26倍的加速(与广泛使用的LRU策略相比为1.74倍)。与获得的应用程序加速相比,实现此加速所引入的开销很小,即观察到的最高开销(用于计算我们的MinDeg替换策略)仅对获得的应用程序加速产生0.30%的影响。我们的MinDeg算法的并行硬件实现只需要4440个等效的门,这相当于一个真实世界可重构加速器平均需求的64%(注意:每个内核需要多个加速器)。然而,我们的MinDeg策略不依赖于硬件支持,也就是说,在硬件需求和加速之间进行权衡是可能的。
{"title":"MinDeg: a performance-guided replacement policy for run-time reconfigurable accelerators","authors":"L. Bauer, M. Shafique, J. Henkel","doi":"10.1145/1629435.1629481","DOIUrl":"https://doi.org/10.1145/1629435.1629481","url":null,"abstract":"Reconfigurable Processors utilize a reconfigurable fabric (to implement application-specific accelerators) and may perform run-time reconfigurations to exchange the set of deployed accelerators during application execution. Depending on the application requirements, the high utilization of the reconfigurable fabric (due to run-time reconfiguration) leads to a performance improvement compared to non-reconfigurable application-specific processors (ASIPs). However, as the reconfiguration time of fine-grained reconfigurable fabrics (i.e. FPGA-like structures) is rather long (in the range of milliseconds), it is crucial to avoid frequent cycles of reconfiguration-replacement-reconfiguration of the accelerators in order to exploit the real benefits of Reconfigurable Processors. Similar to memory caches, a replacement policy has to decide which reconfigurable accelerators shall be replaced in order to reconfigure additional accelerators. In the case that a recently replaced accelerator is demanded again, the reconfiguration delay might noticeably increase the application execution time.\u0000 In this paper, we demonstrate that well-known policies for cache and page replacement (typically also used in state-of-the-art Reconfigurable Processors) are not generally suitable to replace reconfigurable accelerators.\u0000 We therefore propose our novel performance-guided Minimum Degradation (MinDeg) replacement policy that particularly targets Reconfigurable Processors and replaces reconfigurable accelerators at run time. It accounts for the performance penalty that occurs due to replacement of a certain accelerator. Comparisons with the most-prominent replacement policies show the superiority of our approach. We evaluate and compare MinDeg for a wide range of different reconfiguration bandwidths and reconfigurable fabric sizes and achieve a speedup of up to 2.26x (1.74x compared to the widely used LRU policy). The introduced overhead to achieve this speedup is minor in comparison to the obtained application acceleration, i.e. the highest observed overhead (to calculate our MinDeg replacement policy) affected the obtained application acceleration by only 0.30%. A parallel hardware implementation of our MinDeg algorithm demands only 4,440 gate equivalents, which corresponds to 64% of the average requirements of one real-world reconfigurable accelerator (note: multiple accelerators are demanded per kernel). However, our MinDeg policy does not rely on hardware support, i.e. a trade-off between the hardware requirements and the acceleration is possible.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121159517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
International Conference on Hardware/Software Codesign and System Synthesis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1