Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00027
Sarika Jain, Archana Patel
Identifying an event and all its attributes help in timely response to emergencies or business decisions. Although accurate event identification has been studied in the last decade, fewer thoughts have been put into determining actions with context-dependent effects. This paper is motivated by the desire to develop a synergy between the different answers on the same query posed by users of differing priority. The proposed approach exploits semantic technologies to model the personalized behavior. We provide a control protocol that recognizes the pattern in the flow of precision as the priority of user changes. The control protocol has been utilized to define the priority of the user and is exploited in an efficient algorithm to yield good tradeoffs between various attributes of the decision. Both bottom-up and top-down parsing of the ontological knowledge base is depicted depending on whether the event object is available in the knowledge base or not. The algorithm is then tested on the real-world use case of events of terrorist attacks. The algorithm renders varying answer with varying precision based on a balance between the available resources, the required certainty, the required specificity level, and the acceptable threshold value. The proposed control protocol and the algorithm proved to be logically sound and seem to be a direct consequence of representing knowledge in a manner that is complete.
{"title":"Smart Ontology-Based Event Identification","authors":"Sarika Jain, Archana Patel","doi":"10.1109/MCSoC.2019.00027","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00027","url":null,"abstract":"Identifying an event and all its attributes help in timely response to emergencies or business decisions. Although accurate event identification has been studied in the last decade, fewer thoughts have been put into determining actions with context-dependent effects. This paper is motivated by the desire to develop a synergy between the different answers on the same query posed by users of differing priority. The proposed approach exploits semantic technologies to model the personalized behavior. We provide a control protocol that recognizes the pattern in the flow of precision as the priority of user changes. The control protocol has been utilized to define the priority of the user and is exploited in an efficient algorithm to yield good tradeoffs between various attributes of the decision. Both bottom-up and top-down parsing of the ontological knowledge base is depicted depending on whether the event object is available in the knowledge base or not. The algorithm is then tested on the real-world use case of events of terrorist attacks. The algorithm renders varying answer with varying precision based on a balance between the available resources, the required certainty, the required specificity level, and the acceptable threshold value. The proposed control protocol and the algorithm proved to be logically sound and seem to be a direct consequence of representing knowledge in a manner that is complete.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"142 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113997904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00017
Ahmed Kamaleldin, Muhammad Ali, P. Rad, Marcus Gottschalk, D. Göhringer
Current application domains, like mobile robotics, or internet of things require high computational power associated with low energy consumption. Therefore, MPSoCs are widely used as an adequate platform for high performance embedded computation. Recently, the emergence of RISC-V instruction set architecture drives SoC designers to adopt it in the design of MPSoCs as a cost-free, modular processor and suitable to be implemented in different hardware platforms. Furthermore, these characteristics make the RISC-V an interesting candidate for an FPGA soft-core processor. In this paper, we present a modular hybrid memory system for a lightweight RISC-V based MPSoC architecture. The implementation of the hybrid memory consists of a global scratchpad on-chip shared memory for both instruction and data for the purpose of communication and synchronization between the processing elements. In addition to a tightly coupled memory associated with each processing element for low latency memory access for private computation. Moreover, the complete MPSoC architecture is scalable and configurable, in terms of the number of PEs, shared/private memory sizes and the number of memory mapped peripherals. A benchmarking environment is developed to evaluate the performance of the proposed hybrid memory system in terms of memory access latency and memory bandwidth and their impact on the computation time. The complete MPSoC architecture is implemented and tested on a Xilinx Zynq 7000 FPGA device.
{"title":"Modular Memory System for RISC-V Based MPSoCs on Xilinx FPGAs","authors":"Ahmed Kamaleldin, Muhammad Ali, P. Rad, Marcus Gottschalk, D. Göhringer","doi":"10.1109/MCSoC.2019.00017","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00017","url":null,"abstract":"Current application domains, like mobile robotics, or internet of things require high computational power associated with low energy consumption. Therefore, MPSoCs are widely used as an adequate platform for high performance embedded computation. Recently, the emergence of RISC-V instruction set architecture drives SoC designers to adopt it in the design of MPSoCs as a cost-free, modular processor and suitable to be implemented in different hardware platforms. Furthermore, these characteristics make the RISC-V an interesting candidate for an FPGA soft-core processor. In this paper, we present a modular hybrid memory system for a lightweight RISC-V based MPSoC architecture. The implementation of the hybrid memory consists of a global scratchpad on-chip shared memory for both instruction and data for the purpose of communication and synchronization between the processing elements. In addition to a tightly coupled memory associated with each processing element for low latency memory access for private computation. Moreover, the complete MPSoC architecture is scalable and configurable, in terms of the number of PEs, shared/private memory sizes and the number of memory mapped peripherals. A benchmarking environment is developed to evaluate the performance of the proposed hybrid memory system in terms of memory access latency and memory bandwidth and their impact on the computation time. The complete MPSoC architecture is implemented and tested on a Xilinx Zynq 7000 FPGA device.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114852812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00010
Mulya Agung, Muhammad Alfian Amrizal, Ryusuke Egawa, H. Takizawa
MPI process mapping is an important step to achieve scalable performance on non-uniform memory access (NUMA) systems. Conventional approaches have focused only on improving the locality of communication. However, related studies have shown that on modern NUMA systems, the memory congestion problem could cause more severe performance degradation than the locality problem because a high number of processor cores in the systems can cause heavy congestion on shared caches and memory controllers. To optimize the process mapping, it is necessary to determine the communication behavior of the MPI processes. Previous methods rely on offline profiling to analyze the communication behavior, which incurs a high overhead and is potentially time-consuming. In this paper, we propose a method that automatically performs MPI process mapping for adapting to communication behaviors while considering both locality and memory congestion. Our method works at runtime during the execution of an MPI application. It does not require modifications to the application, previous knowledge of the communication behavior, or changes to the hardware and operating system. The proposed method has been evaluated with the NAS parallel benchmarks on a NUMA system. Experimental results show that our method can achieve performance close to an oracle-based mapping method with low overhead to the application execution. The performance improvement is up to 27.4% (13.4% on average) compared with the default mapping of the MPI runtime system.
{"title":"An Automatic MPI Process Mapping Method Considering Locality and Memory Congestion on NUMA Systems","authors":"Mulya Agung, Muhammad Alfian Amrizal, Ryusuke Egawa, H. Takizawa","doi":"10.1109/MCSoC.2019.00010","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00010","url":null,"abstract":"MPI process mapping is an important step to achieve scalable performance on non-uniform memory access (NUMA) systems. Conventional approaches have focused only on improving the locality of communication. However, related studies have shown that on modern NUMA systems, the memory congestion problem could cause more severe performance degradation than the locality problem because a high number of processor cores in the systems can cause heavy congestion on shared caches and memory controllers. To optimize the process mapping, it is necessary to determine the communication behavior of the MPI processes. Previous methods rely on offline profiling to analyze the communication behavior, which incurs a high overhead and is potentially time-consuming. In this paper, we propose a method that automatically performs MPI process mapping for adapting to communication behaviors while considering both locality and memory congestion. Our method works at runtime during the execution of an MPI application. It does not require modifications to the application, previous knowledge of the communication behavior, or changes to the hardware and operating system. The proposed method has been evaluated with the NAS parallel benchmarks on a NUMA system. Experimental results show that our method can achieve performance close to an oracle-based mapping method with low overhead to the application execution. The performance improvement is up to 27.4% (13.4% on average) compared with the default mapping of the MPI runtime system.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122547600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00031
Jie Hou, Qi Han, M. Radetzki
The rapidly increasing transistor density enables the evolution of many-core on-chip systems. Networks-on-Chips (NoCs) are the preferred communication infrastructure for such systems. Technology scaling increases the susceptibility to failures in the NoC's components. However, such a NoC can still operate at the cost of performance degradation. Therefore, it is not sufficient to analyze the performance and reliability of a NoC separately. In this paper, we propose a machine learning enabled performability evaluation framework to treat both aspects together. It applies Markov reward models. In addition, it leverages machine learning techniques to obtain different performance metrics under consideration of faulty routers and various simulation parameters quickly, which is a challenging task in an analytical manner. Moreover, we use a mesh-based NoC to demonstrate our methodology. Long-term performances of mesh 8x8 under XY and fault-tolerant negative-first routing algorithms are evaluated.
{"title":"A Machine Learning Enabled Long-Term Performance Evaluation Framework for NoCs","authors":"Jie Hou, Qi Han, M. Radetzki","doi":"10.1109/MCSoC.2019.00031","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00031","url":null,"abstract":"The rapidly increasing transistor density enables the evolution of many-core on-chip systems. Networks-on-Chips (NoCs) are the preferred communication infrastructure for such systems. Technology scaling increases the susceptibility to failures in the NoC's components. However, such a NoC can still operate at the cost of performance degradation. Therefore, it is not sufficient to analyze the performance and reliability of a NoC separately. In this paper, we propose a machine learning enabled performability evaluation framework to treat both aspects together. It applies Markov reward models. In addition, it leverages machine learning techniques to obtain different performance metrics under consideration of faulty routers and various simulation parameters quickly, which is a challenging task in an analytical manner. Moreover, we use a mesh-based NoC to demonstrate our methodology. Long-term performances of mesh 8x8 under XY and fault-tolerant negative-first routing algorithms are evaluated.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127385232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00012
Hayate Okuhara, Ryosuke Kazami, H. Amano
In this work, we present a low-overhead performance monitor which can emulate the maximum operational frequency of a target system by utilizing a delay chain so as to achieve efficient adaptive voltage control. The proposed monitor can be fully built by logic cells provided by general PDKs; thus, an automatic cell-based design flow can be used for its implementation. In addition, interconnect delay behaviors can also be imitated by exploiting wires which are automatically routed. In order to validate our concept, the proposed monitor is fabricated with a 65-nm Fully Depleted Silicon on Insulator (FD-SOI) technology. Real chip experiments reveal that the automated layout design can achieve the reasonable ability to delay emulation. Indeed, when the maximum operational frequency of a CNN accelerator is emulated, the proposed SDM achieved several percents of the performance tracking error. Also, its power overhead is only few percents.
{"title":"A System Delay Monitor Exploiting Automatic Cell-Based Design Flow and Post-Silicon Calibration","authors":"Hayate Okuhara, Ryosuke Kazami, H. Amano","doi":"10.1109/MCSoC.2019.00012","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00012","url":null,"abstract":"In this work, we present a low-overhead performance monitor which can emulate the maximum operational frequency of a target system by utilizing a delay chain so as to achieve efficient adaptive voltage control. The proposed monitor can be fully built by logic cells provided by general PDKs; thus, an automatic cell-based design flow can be used for its implementation. In addition, interconnect delay behaviors can also be imitated by exploiting wires which are automatically routed. In order to validate our concept, the proposed monitor is fabricated with a 65-nm Fully Depleted Silicon on Insulator (FD-SOI) technology. Real chip experiments reveal that the automated layout design can achieve the reasonable ability to delay emulation. Indeed, when the maximum operational frequency of a CNN accelerator is emulated, the proposed SDM achieved several percents of the performance tracking error. Also, its power overhead is only few percents.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128170498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00051
Sayaka Terashima, Takuya Kojima, Hayate Okuhara, Kazusa Musha, H. Amano, Ryuichi Sakamoto, Masaaki Kondo, M. Namiki
A building block computing system with inductive coupling Through Chip Interface (TCI) consists of 3-D chip stack, each of which is small dedicated chips. By changing the combination of stacked chips, various types of systems can be built. A MIPS R3000 compatible processor GeyserTT, a neural network accelerator SNACC and the shared memory for building the twin-tower of chips SMTT have been developed with a Renesas 65nm low leakage CMOS process. They provide the TCI IP (Intellectual Property), and an escalator network is built just by stacking them. This paper shows each chip evaluation results and performance estimation of stacking them with the RTL simulator. The performance of the single-tower and twin-tower configuration is estimated by RTL simulation when a part of Alexnet is implemented. The evaluation results showed that the single-tower configuration with GeyserTT+SNACC achieved about twice performance as the case with GeyserTT. Also, experimental results using each of the single real chip showed that all of them work at least 50MHz with extremely low power consumption. The twin-tower configuration achieved about 2x of the single-tower, that is about 6x of GeyserTT. The power consumption was about 276mW for the single-tower and 496mW for the twin-tower.
{"title":"A Preliminary Evaluation of Building Block Computing Systems","authors":"Sayaka Terashima, Takuya Kojima, Hayate Okuhara, Kazusa Musha, H. Amano, Ryuichi Sakamoto, Masaaki Kondo, M. Namiki","doi":"10.1109/MCSoC.2019.00051","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00051","url":null,"abstract":"A building block computing system with inductive coupling Through Chip Interface (TCI) consists of 3-D chip stack, each of which is small dedicated chips. By changing the combination of stacked chips, various types of systems can be built. A MIPS R3000 compatible processor GeyserTT, a neural network accelerator SNACC and the shared memory for building the twin-tower of chips SMTT have been developed with a Renesas 65nm low leakage CMOS process. They provide the TCI IP (Intellectual Property), and an escalator network is built just by stacking them. This paper shows each chip evaluation results and performance estimation of stacking them with the RTL simulator. The performance of the single-tower and twin-tower configuration is estimated by RTL simulation when a part of Alexnet is implemented. The evaluation results showed that the single-tower configuration with GeyserTT+SNACC achieved about twice performance as the case with GeyserTT. Also, experimental results using each of the single real chip showed that all of them work at least 50MHz with extremely low power consumption. The twin-tower configuration achieved about 2x of the single-tower, that is about 6x of GeyserTT. The power consumption was about 276mW for the single-tower and 496mW for the twin-tower.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116871098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00057
A. M. Gruzlikov, N. Kolesov, D. Kostygov, M. Tolmacheva
An approach to designing fault-tolerant and power-efficient multicore systems on chip for realtime information processing and control is proposed. It is assumed that a multicore system has a reserve on the chip, allowing for additional information processing. The approach is based on the rules of introducing redundancy aimed at reducing power consumption and the principles of system-level fault diagnosis, making it possible to decentralize the system recovery in case of failure.
{"title":"A Real-Time Fault-Tolerant and Power-Efficient Multicore System on Chip","authors":"A. M. Gruzlikov, N. Kolesov, D. Kostygov, M. Tolmacheva","doi":"10.1109/MCSoC.2019.00057","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00057","url":null,"abstract":"An approach to designing fault-tolerant and power-efficient multicore systems on chip for realtime information processing and control is proposed. It is assumed that a multicore system has a reserve on the chip, allowing for additional information processing. The approach is based on the rules of introducing redundancy aimed at reducing power consumption and the principles of system-level fault diagnosis, making it possible to decentralize the system recovery in case of failure.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124575116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00053
Keita Azegami, Kazusa Musha, Kazuei Hironaka, Akram Ben Ahmed, M. Koibuchi, Yao Hu, H. Amano
FPGAs can be a promising accelerator used for MEC (Multi-access Edge Computing) which provides timing critical services for a number of terminals at the base stations near from edges. Although a high-end FPGA can support a fixed latency computation with a relatively small power consumption, they are expensive and the available acceleration circuits are limited into a size of single FPGA. FiC (Flow-in-Cloud) has been developed for building a virtual large FPGA from a number of middle-range economical FPGAs connected with high speed serial links. Although the current target of FiC is cloud computing, it is more suitable for the future MEC, because huge hardware resource can be supported with small cost. One of the problem to use such multi-FPGA systems for timing critical computation is network uncertainty. With a common packet switching, the computation speed is influenced with the network traffic. That is, the fixed latency computation which could be supported by a single FPGA is hard to be supported with multi-FPGA systems using common packet switching networks. In order to address this problem, we introduced STDM (Static Time Division Multiplexing) switch in the FiC system. Since the STDM always supports a constant communication latency, transfer time can be estimated beforehand. Through the implementation of the STDM switch on the FPGA board for FiC, it appeared that the utilization ratio of the LUTs for the STDM switch is smaller than 14%. The required number of slots is less than 16 even for a system with 256 nodes. We implemented the Conjugate Gradient method, which includes all-to-all communication, on 4x2 FiC system. It achieved 17.9 times performance improvement of Intel E5-2667 2.90GHz CPU with 6 cores.
{"title":"A STDM (Static Time Division Multiplexing) Switch on a Multi-FPGA System","authors":"Keita Azegami, Kazusa Musha, Kazuei Hironaka, Akram Ben Ahmed, M. Koibuchi, Yao Hu, H. Amano","doi":"10.1109/MCSoC.2019.00053","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00053","url":null,"abstract":"FPGAs can be a promising accelerator used for MEC (Multi-access Edge Computing) which provides timing critical services for a number of terminals at the base stations near from edges. Although a high-end FPGA can support a fixed latency computation with a relatively small power consumption, they are expensive and the available acceleration circuits are limited into a size of single FPGA. FiC (Flow-in-Cloud) has been developed for building a virtual large FPGA from a number of middle-range economical FPGAs connected with high speed serial links. Although the current target of FiC is cloud computing, it is more suitable for the future MEC, because huge hardware resource can be supported with small cost. One of the problem to use such multi-FPGA systems for timing critical computation is network uncertainty. With a common packet switching, the computation speed is influenced with the network traffic. That is, the fixed latency computation which could be supported by a single FPGA is hard to be supported with multi-FPGA systems using common packet switching networks. In order to address this problem, we introduced STDM (Static Time Division Multiplexing) switch in the FiC system. Since the STDM always supports a constant communication latency, transfer time can be estimated beforehand. Through the implementation of the STDM switch on the FPGA board for FiC, it appeared that the utilization ratio of the LUTs for the STDM switch is smaller than 14%. The required number of slots is less than 16 even for a system with 256 nodes. We implemented the Conjugate Gradient method, which includes all-to-all communication, on 4x2 FiC system. It achieved 17.9 times performance improvement of Intel E5-2667 2.90GHz CPU with 6 cores.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125801543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00058
Julien Durand, Y. Bouchebaba, L. Santinelli
Today's multi-core and many-core COTS platforms make available a large amount of computational resource for real-time applications. As they aim at increasing performance for real-time, their challenges are the guarantees for timing constraints. Real time modeling and analysis are thus facing shared resources, optimization mechanisms, and sophisticated functionalities which all combine into complex system dynamics that are extremely costly to characterize. This paper proposes a measurement-based approach and a statistical analysis applied to define average and worst-case models to task executions under different possible execution conditions. The framework is formalized and then used to investigate different families of shared resources interference effects occurring on multi-core platforms; such effects are quantified with statistical metrics applied to measurements of tasks execution times. The focus of the work is on effects due to shared memories within the NXP T4240 multi core platform and the PikeOS hypervisor. A set of experiments is conducted to validate the framework proposed.
{"title":"Statistical Analysis for Shared Resources Effects with Multi-Core Real-Time Systems","authors":"Julien Durand, Y. Bouchebaba, L. Santinelli","doi":"10.1109/MCSoC.2019.00058","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00058","url":null,"abstract":"Today's multi-core and many-core COTS platforms make available a large amount of computational resource for real-time applications. As they aim at increasing performance for real-time, their challenges are the guarantees for timing constraints. Real time modeling and analysis are thus facing shared resources, optimization mechanisms, and sophisticated functionalities which all combine into complex system dynamics that are extremely costly to characterize. This paper proposes a measurement-based approach and a statistical analysis applied to define average and worst-case models to task executions under different possible execution conditions. The framework is formalized and then used to investigate different families of shared resources interference effects occurring on multi-core platforms; such effects are quantified with statistical metrics applied to measurements of tasks execution times. The focus of the work is on effects due to shared memories within the NXP T4240 multi core platform and the PikeOS hypervisor. A set of experiments is conducted to validate the framework proposed.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128276771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00045
Thanh Cong, François Charot
In this paper, we propose an approach for designing application-specific heterogeneous systems based on performance models through combining accelerator and processor core models. An application-specific program is profiled by the dynamic execution trace and is used to construct a data flow model of the accelerator. Modeling of the processor is partitioned into an instruction set architecture (ISA) execution and a micro-architecture specific timing model. These models are implemented on FPGAs to take advantage of their parallelism and speed up the simulation when architecture complexity increases. This approach aims to ease the design of multi-core multi-accelerator architecture, consequently contributes to explore the design space by automating the design steps. A case study is conducted to confirm that presented design flow can model the accelerator starting from an algorithm, validate its integration in a simulation framework, allowing precise performance to be estimated. We also assess the performance of our RISC-V single-core and RISC-V-based heterogeneous architecture models.
{"title":"Designing Application-Specific Heterogeneous Architectures from Performance Models","authors":"Thanh Cong, François Charot","doi":"10.1109/MCSoC.2019.00045","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00045","url":null,"abstract":"In this paper, we propose an approach for designing application-specific heterogeneous systems based on performance models through combining accelerator and processor core models. An application-specific program is profiled by the dynamic execution trace and is used to construct a data flow model of the accelerator. Modeling of the processor is partitioned into an instruction set architecture (ISA) execution and a micro-architecture specific timing model. These models are implemented on FPGAs to take advantage of their parallelism and speed up the simulation when architecture complexity increases. This approach aims to ease the design of multi-core multi-accelerator architecture, consequently contributes to explore the design space by automating the design steps. A case study is conducted to confirm that presented design flow can model the accelerator starting from an algorithm, validate its integration in a simulation framework, allowing precise performance to be estimated. We also assess the performance of our RISC-V single-core and RISC-V-based heterogeneous architecture models.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133943103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}