Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00038
Adele Maleki, Hamidreza Ahmadian, R. Obermaisser
The increasing trend towards mixed-critical systems, in which applications with different levels of criticality coexist and interact on the same platform, calls for fault-tolerant hardware platforms. On the other hand, due to the demanded performance in such systems, network-on-chips are employed to interconnect several computation resources. Consequently, the detection and localization of faults in the communication and computation resources becomes a challenge, if a high number of shared resources (e.g., routers, physical links) are used. This paper proposes a new hardware architecture for run-time fault detection and localization in mixed-criticality networks-on-chips. The proposed architecture detects the transient and permanent faults in the network and distinguishes between faults of different resources. The fault detection and localization mechanisms have been evaluated using Gem5 simulation and example scenarios.
{"title":"Fault Detection and Localization for Network-on-Chips in Mixed-Criticality Systems","authors":"Adele Maleki, Hamidreza Ahmadian, R. Obermaisser","doi":"10.1109/MCSoC.2019.00038","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00038","url":null,"abstract":"The increasing trend towards mixed-critical systems, in which applications with different levels of criticality coexist and interact on the same platform, calls for fault-tolerant hardware platforms. On the other hand, due to the demanded performance in such systems, network-on-chips are employed to interconnect several computation resources. Consequently, the detection and localization of faults in the communication and computation resources becomes a challenge, if a high number of shared resources (e.g., routers, physical links) are used. This paper proposes a new hardware architecture for run-time fault detection and localization in mixed-criticality networks-on-chips. The proposed architecture detects the transient and permanent faults in the network and distinguishes between faults of different resources. The fault detection and localization mechanisms have been evaluated using Gem5 simulation and example scenarios.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123105385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00054
J. Spieck, S. Wildermann, T. Schwarzer, J. Teich, M. Glaß
For applications whose workload and execution behavior significantly varies with the input, a single mapping of application tasks to a given target architecture is insufficient. A single mapping may deliver a high-quality solution for the average case but rarely exploits the specific execution behavior of concurrent tasks triggered by each input tuple. E.g., tasks with higher computational demands under certain input should be mapped onto high-performance resources of the heterogeneous architecture. This necessitates mappings that are specialized for specific input data. Yet, due to the large size of input combinations, determining a separate optimized mapping for each individual input workload is not feasible for most applications. As a remedy, we propose to group input data with similar execution characteristics into a selected, small number of so-called workload scenarios for which we supply optimized mappings. In this paper, we provide a data-driven approach for detecting workload scenarios and exploring scenario-optimized mappings based on a collection of input data. The identification of scenarios and the determination of optimized mappings are interdependent: For the data-driven identification of workload scenarios, we have to measure the profiles when executing the application with the given input data for different application mappings. However, to come up with scenario-optimized application mappings, the workload scenarios have to be known. We tackle this interdependence problem by proposing a cyclic design methodology that optimizes both aspects in an iterative fashion. It is shown that with our approach, the latency of two exemplary applications, a ray tracing as well as an image stitching application, can be significantly improved compared to methods that ignore workload scenarios or do not perform the proposed iterative refinement. Furthermore, we demonstrate that our proposal can be used in the context of a hybrid application mapping methodology.
{"title":"Data-Driven Scenario-Based Application Mapping for Heterogeneous Many-Core Systems","authors":"J. Spieck, S. Wildermann, T. Schwarzer, J. Teich, M. Glaß","doi":"10.1109/MCSoC.2019.00054","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00054","url":null,"abstract":"For applications whose workload and execution behavior significantly varies with the input, a single mapping of application tasks to a given target architecture is insufficient. A single mapping may deliver a high-quality solution for the average case but rarely exploits the specific execution behavior of concurrent tasks triggered by each input tuple. E.g., tasks with higher computational demands under certain input should be mapped onto high-performance resources of the heterogeneous architecture. This necessitates mappings that are specialized for specific input data. Yet, due to the large size of input combinations, determining a separate optimized mapping for each individual input workload is not feasible for most applications. As a remedy, we propose to group input data with similar execution characteristics into a selected, small number of so-called workload scenarios for which we supply optimized mappings. In this paper, we provide a data-driven approach for detecting workload scenarios and exploring scenario-optimized mappings based on a collection of input data. The identification of scenarios and the determination of optimized mappings are interdependent: For the data-driven identification of workload scenarios, we have to measure the profiles when executing the application with the given input data for different application mappings. However, to come up with scenario-optimized application mappings, the workload scenarios have to be known. We tackle this interdependence problem by proposing a cyclic design methodology that optimizes both aspects in an iterative fashion. It is shown that with our approach, the latency of two exemplary applications, a ray tracing as well as an image stitching application, can be significantly improved compared to methods that ignore workload scenarios or do not perform the proposed iterative refinement. Furthermore, we demonstrate that our proposal can be used in the context of a hybrid application mapping methodology.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"158 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115992756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00047
Amin Majd, Mohammad Loni, Golnaz Sahebi, M. Daneshtalab, E. Troubitsyna
Advances in hardware architecture regarding multi-core processors make parallel computing ubiquitous. To achieve the maximum utilization of multi-core processors, parallel programming techniques are required. However, there are several challenges standing in front of parallel programming. These problems are mainly divided into three major groups. First, although recent advancements in parallel programming languages (e.g. MPI, OpenCL, etc.) assist developers, still parallel programming is not desirable for most programmers. The second one belongs to the massive volume of old software and applications, which have been written in serial mode. However, converting millions of line of serial codes to parallel codes is highly time-consuming and requiring huge verification effort. Third, the production of software and applications in parallel mode is very expensive since it needs knowledge and expertise. Super-optimization provided by super compilers is the process of automatically determine the dependent and independent instructions to find any data dependency and loop-free sequence of instructions. Super compiler then runs these instructions on different processors in the parallel mode, if it is possible. Super-optimization is a feasible solution for helping the programmer to get relaxed from parallel programming workload. Since the most complexity of the sequential codes is in the nested loops, we try to parallelize the nested loops by using the idea of super-optimization. One of the underlying stages in the super-optimization is scheduling tiled space for iterating nested loops. Since the problem is NP-Hard, using the traditional optimization methods are not feasible. In this paper, we propose a cloud-based super-optimization method as Software-as-a-Service (SaaS) to reduce the cost of parallel programming. In addition, it increases the utilization of the processing capacity of the multi-core processor. As the result, an intermediate programmer can use the whole processing capacity of his/her system without knowing anything about writing parallel codes or super compiler functions by sending the serial code to a cloud server and receiving the parallel version of the code from the cloud server. In this paper, an evolutionary algorithm is leveraged to solve the scheduling problem of tiles. Our proposed super-optimization method will serve as software and provided as a hybrid (public and private) deployment model.
{"title":"A Cloud Based Super-Optimization Method to Parallelize the Sequential Code's Nested Loops","authors":"Amin Majd, Mohammad Loni, Golnaz Sahebi, M. Daneshtalab, E. Troubitsyna","doi":"10.1109/MCSoC.2019.00047","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00047","url":null,"abstract":"Advances in hardware architecture regarding multi-core processors make parallel computing ubiquitous. To achieve the maximum utilization of multi-core processors, parallel programming techniques are required. However, there are several challenges standing in front of parallel programming. These problems are mainly divided into three major groups. First, although recent advancements in parallel programming languages (e.g. MPI, OpenCL, etc.) assist developers, still parallel programming is not desirable for most programmers. The second one belongs to the massive volume of old software and applications, which have been written in serial mode. However, converting millions of line of serial codes to parallel codes is highly time-consuming and requiring huge verification effort. Third, the production of software and applications in parallel mode is very expensive since it needs knowledge and expertise. Super-optimization provided by super compilers is the process of automatically determine the dependent and independent instructions to find any data dependency and loop-free sequence of instructions. Super compiler then runs these instructions on different processors in the parallel mode, if it is possible. Super-optimization is a feasible solution for helping the programmer to get relaxed from parallel programming workload. Since the most complexity of the sequential codes is in the nested loops, we try to parallelize the nested loops by using the idea of super-optimization. One of the underlying stages in the super-optimization is scheduling tiled space for iterating nested loops. Since the problem is NP-Hard, using the traditional optimization methods are not feasible. In this paper, we propose a cloud-based super-optimization method as Software-as-a-Service (SaaS) to reduce the cost of parallel programming. In addition, it increases the utilization of the processing capacity of the multi-core processor. As the result, an intermediate programmer can use the whole processing capacity of his/her system without knowing anything about writing parallel codes or super compiler functions by sending the serial code to a cloud server and receiving the parallel version of the code from the cloud server. In this paper, an evolutionary algorithm is leveraged to solve the scheduling problem of tiles. Our proposed super-optimization method will serve as software and provided as a hybrid (public and private) deployment model.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128589525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00055
Zeyang Dai, Lei Jing
Attitude estimation is an important part for navigation of mobile robotics and unmanned aerial vehicle (UAV) control. Although the Extended Kalman Filter (EKF) can be done typically, the trend is to use Sigma-Point Kalman Filter (SPKF) instead due to its higher accuracy and robustness in harsh environment. The only drawback of such system is the higher computation cost. In order to accelerate the system, most approaches based on Field Programmable Gate Arrays (FPGA) are proposed in the past but too specific, which is not reusable and the high price for design complexity. With looking for re-usability, we present an IP core called matrix operation accelerator in this paper. Moreover, we do the verification on Zynq-7020, the experimental result shows that the proposed scheme can reduce about 50% computing time and save silicon as well.
{"title":"Real-Time Attitude Estimation of Sigma-Point Kalman Filter via Matrix Operation Accelerator","authors":"Zeyang Dai, Lei Jing","doi":"10.1109/MCSoC.2019.00055","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00055","url":null,"abstract":"Attitude estimation is an important part for navigation of mobile robotics and unmanned aerial vehicle (UAV) control. Although the Extended Kalman Filter (EKF) can be done typically, the trend is to use Sigma-Point Kalman Filter (SPKF) instead due to its higher accuracy and robustness in harsh environment. The only drawback of such system is the higher computation cost. In order to accelerate the system, most approaches based on Field Programmable Gate Arrays (FPGA) are proposed in the past but too specific, which is not reusable and the high price for design complexity. With looking for re-usability, we present an IP core called matrix operation accelerator in this paper. Moreover, we do the verification on Zynq-7020, the experimental result shows that the proposed scheme can reduce about 50% computing time and save silicon as well.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132664935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00032
M. Meyer, Yu Wang, Takahiro Watanabe
As the number of cores on a single chip increased, the inter-core communication system quickly became the performance bottleneck. In order to solve the performance and scalability issues of bus-based systems, Network-on-chip (NoC) was proposed. This eventually met its own bottleneck and several technologies sprouted out from NoC research. The most commonly researched upgrade to NoCs was 3D NoCs, which utilized stacked routers to reduce the maximum hop count. Other researchers have looked at alternative transmission mediums, such as photonics. These technologies can be combined to give great performance and power benefits but can be slowed down by congestion in their path-setup phase. In order to solve this issue, we propose a traffic-aware routing algorithm that can evenly distribute the traffic throughout the chip, all while simultaneously avoiding faulty nodes. The results show that the proposed algorithm was successful in balancing the load across the chip and that the performance costs of the algorithm were mostly offset by the benefits of reducing blocked paths.
{"title":"Fault-Tolerant Traffic-Aware Routing Algorithm for 3-D Photonic Networks-on-Chip","authors":"M. Meyer, Yu Wang, Takahiro Watanabe","doi":"10.1109/MCSoC.2019.00032","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00032","url":null,"abstract":"As the number of cores on a single chip increased, the inter-core communication system quickly became the performance bottleneck. In order to solve the performance and scalability issues of bus-based systems, Network-on-chip (NoC) was proposed. This eventually met its own bottleneck and several technologies sprouted out from NoC research. The most commonly researched upgrade to NoCs was 3D NoCs, which utilized stacked routers to reduce the maximum hop count. Other researchers have looked at alternative transmission mediums, such as photonics. These technologies can be combined to give great performance and power benefits but can be slowed down by congestion in their path-setup phase. In order to solve this issue, we propose a traffic-aware routing algorithm that can evenly distribute the traffic throughout the chip, all while simultaneously avoiding faulty nodes. The results show that the proposed algorithm was successful in balancing the load across the chip and that the performance costs of the algorithm were mostly offset by the benefits of reducing blocked paths.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132485565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00042
Lu Peng, Wentao Shi, Jian Zhang, Samuel Irving
Recurrent Neural Networks (RNNs) have continued to facilitate rapid progress in a variety of academic and industrial fields, though their complexity continues to make efficient deployment difficult; when the RNN model size is not properly matched to hardware resources, performance can suffer from hardware under-utilization. In this work, we propose to explore model-level parallelism for LSTM-RNN accelerators in different levels of the model using a multicore design. The multi-core design proposed in this work operates in three computing modes: multi-programming mode in which independent models are executed; multithreading mode in which parallelism among layers of an LSTM model is explored and properly scheduled; and helper-core mode in which cores collaborate on a single LSTM layer in a lower model level comparing with multithread mode. Our design can achieve up to 1.98x speedup in "multi-programming" mode, a 1.91x speedup in "multithreading" mode and a 1.88x speedup in "helper-core" mode over the single-core design.
{"title":"Exploiting Model-Level Parallelism in Recurrent Neural Network Accelerators","authors":"Lu Peng, Wentao Shi, Jian Zhang, Samuel Irving","doi":"10.1109/MCSoC.2019.00042","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00042","url":null,"abstract":"Recurrent Neural Networks (RNNs) have continued to facilitate rapid progress in a variety of academic and industrial fields, though their complexity continues to make efficient deployment difficult; when the RNN model size is not properly matched to hardware resources, performance can suffer from hardware under-utilization. In this work, we propose to explore model-level parallelism for LSTM-RNN accelerators in different levels of the model using a multicore design. The multi-core design proposed in this work operates in three computing modes: multi-programming mode in which independent models are executed; multithreading mode in which parallelism among layers of an LSTM model is explored and properly scheduled; and helper-core mode in which cores collaborate on a single LSTM layer in a lower model level comparing with multithread mode. Our design can achieve up to 1.98x speedup in \"multi-programming\" mode, a 1.91x speedup in \"multithreading\" mode and a 1.88x speedup in \"helper-core\" mode over the single-core design.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125504626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00056
Manuel Strobel, M. Radetzki
Embedded multi-core systems are increasingly in use. As established single-core design methodologies are often not applicable out of the box, novel design-time optimization methods are required in order to manage real-time characteristics, predictability, or tight constraints with respect to energy consumption or system performance. With focus on the memory subsystem in a multi-core embedded system, this paper proposes an optimization workflow for the application-specific optimal binding of code and data to memory instances, efficient handling and scheduling of available memory low-power modes, and the automated and transparent integration of these optimization results on the software level. Presented optimization algorithms are realized as integer linear programs; code modification and generation are implemented on the basis of LLVM. Experimental results for an ARM-based quad-core platform with SRAM memory subsystem, consisting of core-local scratchpad memories and global shared memory, prove the efficiency of our method in terms of energy consumption when compared to a system using direct-mapped caches, but also in comparison with a state-of-the-art scratchpad mapping heuristic.
{"title":"Design-Time Memory Subsystem Optimization for Low-Power Multi-Core Embedded Systems","authors":"Manuel Strobel, M. Radetzki","doi":"10.1109/MCSoC.2019.00056","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00056","url":null,"abstract":"Embedded multi-core systems are increasingly in use. As established single-core design methodologies are often not applicable out of the box, novel design-time optimization methods are required in order to manage real-time characteristics, predictability, or tight constraints with respect to energy consumption or system performance. With focus on the memory subsystem in a multi-core embedded system, this paper proposes an optimization workflow for the application-specific optimal binding of code and data to memory instances, efficient handling and scheduling of available memory low-power modes, and the automated and transparent integration of these optimization results on the software level. Presented optimization algorithms are realized as integer linear programs; code modification and generation are implemented on the basis of LLVM. Experimental results for an ARM-based quad-core platform with SRAM memory subsystem, consisting of core-local scratchpad memories and global shared memory, prove the efficiency of our method in terms of energy consumption when compared to a system using direct-mapped caches, but also in comparison with a state-of-the-art scratchpad mapping heuristic.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125582181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00009
S. Ohshima, I. Yamazaki, Akihiro Ida, Rio Yokota
Dense-matrix–vector multiplication is one of the well-known important matrix calculations. This calculation is provided a general matrix–vector multiplication (GEMV) function in the basic linear algebra subprograms (BLAS) libraries for several computation hardware. Traditionally, studies focus one large dense-matrix (the length of each side of the dense matrix is long)–vector multiplication. However, some applications require acceleration of numerous small dense-matrix–vector multiplications. This feature is provided by batched BLAS libraries. This calculation is also needed to compute a hierarchical-matrix–vector multiplication. In this study, we implemented numerous small dense-matrix–vector multiplications on a Pascal GPU and evaluated the performance. Thus, we considered the impact of optimization parameters and succeeded in obtaining a better performance than previous works. The maximum differences from our previous work is 28.47% and from batched GEMV of MAGMA BLAS is upto 81.81%. Moreover, we considered the use of two optimization parameters in one GPU kernel; one parameter was applied to some matrices, whereas the second parameter was applied to other matrices. The amount of the improvement was limited (upto 5%), a performance improvement was achieved. Our result will serve as a good reference for users who need to use numerous small dense-matrix–vector multiplications on a GPU and want to optimize a matrix–vector multiplication by hand-tuning and auto-tuning.
{"title":"Optimization of Numerous Small Dense-Matrix–Vector Multiplications in H-Matrix Arithmetic on GPU","authors":"S. Ohshima, I. Yamazaki, Akihiro Ida, Rio Yokota","doi":"10.1109/MCSoC.2019.00009","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00009","url":null,"abstract":"Dense-matrix–vector multiplication is one of the well-known important matrix calculations. This calculation is provided a general matrix–vector multiplication (GEMV) function in the basic linear algebra subprograms (BLAS) libraries for several computation hardware. Traditionally, studies focus one large dense-matrix (the length of each side of the dense matrix is long)–vector multiplication. However, some applications require acceleration of numerous small dense-matrix–vector multiplications. This feature is provided by batched BLAS libraries. This calculation is also needed to compute a hierarchical-matrix–vector multiplication. In this study, we implemented numerous small dense-matrix–vector multiplications on a Pascal GPU and evaluated the performance. Thus, we considered the impact of optimization parameters and succeeded in obtaining a better performance than previous works. The maximum differences from our previous work is 28.47% and from batched GEMV of MAGMA BLAS is upto 81.81%. Moreover, we considered the use of two optimization parameters in one GPU kernel; one parameter was applied to some matrices, whereas the second parameter was applied to other matrices. The amount of the improvement was limited (upto 5%), a performance improvement was achieved. Our result will serve as a good reference for users who need to use numerous small dense-matrix–vector multiplications on a GPU and want to optimize a matrix–vector multiplication by hand-tuning and auto-tuning.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126493087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00033
Kazuki Anzai, Y. Watanobe
An algorithm to determine the extended edit distance between program codes is presented. In addition to the conventional Levenshtein distance, the extended edit distance considers some common operations to a program code to find similar programs more accurately. To calculate the distance, the algorithm employs dynamic programming techniques as well as an algorithm for solving the minimum cost flow on a bipartite graph. In this paper, details of the algorithm and experimental results are presented. These experiments were conducted with source code submitted to an online judge system, where a number of source codes for each programming problem are located. The results show that the proposed algorithm can find source code that cannot be found by the conventional Levenshtein distance, with a higher probability.
{"title":"Algorithm to Determine Extended Edit Distance between Program Codes","authors":"Kazuki Anzai, Y. Watanobe","doi":"10.1109/MCSoC.2019.00033","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00033","url":null,"abstract":"An algorithm to determine the extended edit distance between program codes is presented. In addition to the conventional Levenshtein distance, the extended edit distance considers some common operations to a program code to find similar programs more accurately. To calculate the distance, the algorithm employs dynamic programming techniques as well as an algorithm for solving the minimum cost flow on a bipartite graph. In this paper, details of the algorithm and experimental results are presented. These experiments were conducted with source code submitted to an online judge system, where a number of source codes for each programming problem are located. The results show that the proposed algorithm can find source code that cannot be found by the conventional Levenshtein distance, with a higher probability.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"6 15","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113932060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00034
Kenta Terada, Y. Watanobe
In solving programming problems, it is difficult for beginners to create program code from scratch. One way to navigate this difficulty is to provide a programming problem to them which takes a fill-in-the-blank format. In this work, we propose a method to automatically generate programming problems that has two key constituents, selection of exemplary source code and selection of places to be blanks. In terms of selecting exemplary source code, k-means clustering with silhouette analysis in the Online Judge System (OJ) is proposed. Regarding the selection of places to be blanks, a model based on a bidirectional Long Short-Term Memory Network (Bi-LSTM) with a sequential Conditional Random Field (CRF) is proposed. We discuss evaluation of the proposed approach in the context of how fill-in-the-blank programming problems are generated.
{"title":"Automatic Generation of Fill-in-the-Blank Programming Problems","authors":"Kenta Terada, Y. Watanobe","doi":"10.1109/MCSoC.2019.00034","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00034","url":null,"abstract":"In solving programming problems, it is difficult for beginners to create program code from scratch. One way to navigate this difficulty is to provide a programming problem to them which takes a fill-in-the-blank format. In this work, we propose a method to automatically generate programming problems that has two key constituents, selection of exemplary source code and selection of places to be blanks. In terms of selecting exemplary source code, k-means clustering with silhouette analysis in the Online Judge System (OJ) is proposed. Regarding the selection of places to be blanks, a model based on a bidirectional Long Short-Term Memory Network (Bi-LSTM) with a sequential Conditional Random Field (CRF) is proposed. We discuss evaluation of the proposed approach in the context of how fill-in-the-blank programming problems are generated.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"202 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114051797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}