Software clone research is of high relevance for software engineering research and practice. Software clones are often a result of copying and pasting as an act of ad-hoc reuse by programmers, and can occur at many levels, from simple statement sequences to blocks, methods, classes, source files, subsystems, models, architectures and entire designs, and in all software artifacts (code, models, requirements or architecture documentation, etc.). While sometimes clones have a demonstrably bad influence on code quality, other studies have shown they can have beneficial effects on the code if used carefully. In this workshop, we seek to discuss new and active results from the research community. In particular, IWSC aims to bring together researchers and practitioners to evaluate the current state of research, discuss common problems, discover opportunities for collaboration, exchange ideas, and explore synergies with similarity analysis in other areas and disciplines.
{"title":"Message from the Chairs","authors":"Hitesh Sajnani, Chaiyong Ragkhitwetsagul, Manishankar Mondal","doi":"10.1109/mcsoc.2019.00005","DOIUrl":"https://doi.org/10.1109/mcsoc.2019.00005","url":null,"abstract":"Software clone research is of high relevance for software engineering research and practice. Software clones are often a result of copying and pasting as an act of ad-hoc reuse by programmers, and can occur at many levels, from simple statement sequences to blocks, methods, classes, source files, subsystems, models, architectures and entire designs, and in all software artifacts (code, models, requirements or architecture documentation, etc.). While sometimes clones have a demonstrably bad influence on code quality, other studies have shown they can have beneficial effects on the code if used carefully. In this workshop, we seek to discuss new and active results from the research community. In particular, IWSC aims to bring together researchers and practitioners to evaluate the current state of research, discuss common problems, discover opportunities for collaboration, exchange ideas, and explore synergies with similarity analysis in other areas and disciplines.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131837140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00040
Yaoying Luo, M. Meyer, Xin Jiang, Takahiro Watanabe
The Networks-on-Chip (NoC) is widely accepted as an advanced on-chip system which replaces the traditional bus structure. NoC is promising as a solution for future many-core chip processor with better scalability and flexibility. Routers in NoC make the routing decision based on the routing algorithm. Many routing algorithms have been proposed to improve the performance of NoC. Some routing algorithms only have superiority under a specific traffic pattern, but they can have poor performance under other traffic patterns. Compared to uniform traffic, some complex hotspot patterns are closer to reality. Traffic-aware routing algorithms are designed to solve this problem. These traffic-aware routing algorithms commonly utilize virtual channels (VC) or routing tables to predict the future traffic distribution, which will have large power and hardware overheads that cannot be ignored. To solve these problems, a VC-free traffic-pattern-aware routing algorithm based on West-first routing and North-last routing is proposed in this paper. This algorithm contains a hotspot node and hotspot pattern detecting mechanism, which were designed to improve the performance of NoCs under different traffic patterns. A hotspot information block which has a small cost is connected to each router to deal with the hotspot information and detect the hotspot patterns. The simulation results show that routing algorithm proposed combines the advantages of the two existing routing algorithms and has better performance when considering different traffic patterns.
{"title":"A Hotspot-Pattern-Aware Routing Algorithm for Networks-on-Chip","authors":"Yaoying Luo, M. Meyer, Xin Jiang, Takahiro Watanabe","doi":"10.1109/MCSoC.2019.00040","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00040","url":null,"abstract":"The Networks-on-Chip (NoC) is widely accepted as an advanced on-chip system which replaces the traditional bus structure. NoC is promising as a solution for future many-core chip processor with better scalability and flexibility. Routers in NoC make the routing decision based on the routing algorithm. Many routing algorithms have been proposed to improve the performance of NoC. Some routing algorithms only have superiority under a specific traffic pattern, but they can have poor performance under other traffic patterns. Compared to uniform traffic, some complex hotspot patterns are closer to reality. Traffic-aware routing algorithms are designed to solve this problem. These traffic-aware routing algorithms commonly utilize virtual channels (VC) or routing tables to predict the future traffic distribution, which will have large power and hardware overheads that cannot be ignored. To solve these problems, a VC-free traffic-pattern-aware routing algorithm based on West-first routing and North-last routing is proposed in this paper. This algorithm contains a hotspot node and hotspot pattern detecting mechanism, which were designed to improve the performance of NoCs under different traffic patterns. A hotspot information block which has a small cost is connected to each router to deal with the hotspot information and detect the hotspot patterns. The simulation results show that routing algorithm proposed combines the advantages of the two existing routing algorithms and has better performance when considering different traffic patterns.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127627072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00011
Tomohiro Suzuki
Task parallel algorithms have attracted attention as algorithms for highly parallel architectures in recent years. The aim of such algorithms is to keep all computing resources running without stalling by executing a large number of fine-grained tasks asynchronously while observing data dependencies. The tile algorithm of matrix decomposition of dense matrices is implemented using a task parallel programming model following such an approach. In this article, we will consider how to select tile size, which is an important performance parameter.
{"title":"Performance Tuning of Tile Matrix Decomposition","authors":"Tomohiro Suzuki","doi":"10.1109/MCSoC.2019.00011","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00011","url":null,"abstract":"Task parallel algorithms have attracted attention as algorithms for highly parallel architectures in recent years. The aim of such algorithms is to keep all computing resources running without stalling by executing a large number of fine-grained tasks asynchronously while observing data dependencies. The tile algorithm of matrix decomposition of dense matrices is implemented using a task parallel programming model following such an approach. In this article, we will consider how to select tile size, which is an important performance parameter.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132751342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00048
Aki Nakamura, Y. Okuyama, R. Oka
Air-drawn character recognition is one of the input methods using human body movements. Time-Space Continuous Dynamic Programming (TSCDP) is one of the algorithms that can implement such a task by detecting pre-defined trajectories from input videos. Since TSCDP requires massive computation, it is hard to make the system work in real-time with a single processor. In this paper, we investigated the frames per second (fps) requirements for the air-drawn character recognition system using TSCDP. We analyzed the dependencies among the calculations of TSCDP for the parallelization using GPUs. We evaluated the computation time with CPU and GPU for desktop and embedded environments. We confirmed that the proposed system works in real-time for real videos in both desktop and embedded environments by comparing with the fps requirements.
{"title":"Real-Time Implementation of Time-Space Continuous Dynamic Programming for Air-Drawn Character Recognition Using GPUs","authors":"Aki Nakamura, Y. Okuyama, R. Oka","doi":"10.1109/MCSoC.2019.00048","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00048","url":null,"abstract":"Air-drawn character recognition is one of the input methods using human body movements. Time-Space Continuous Dynamic Programming (TSCDP) is one of the algorithms that can implement such a task by detecting pre-defined trajectories from input videos. Since TSCDP requires massive computation, it is hard to make the system work in real-time with a single processor. In this paper, we investigated the frames per second (fps) requirements for the air-drawn character recognition system using TSCDP. We analyzed the dependencies among the calculations of TSCDP for the parallelization using GPUs. We evaluated the computation time with CPU and GPU for desktop and embedded environments. We confirmed that the proposed system works in real-time for real videos in both desktop and embedded environments by comparing with the fps requirements.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128217192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00059
Eugene Yip, Erjola Lalo, Gerald Lüttgen, A. Sailer
The automotive industry is confronting the multi-core challenge, where legacy and modern software must run correctly and efficiently in parallel, by designing their software around the Logical Execution Time (LET) model. While such designs offer implementations that are platform independent and time predictable, task communications are assumed to complete instantaneously. Thus, it is critical to implement timely data transfers between LET tasks, which may be on different cores, in order to preserve a design's data-flow. In this paper, we develop a lightweight Static Buffering Protocol (SBP) that satisfies the LET communication semantics and supports signal-based communication with multiple signal writers. Our simulation-based evaluation with realistic industrial automotive benchmarks shows that the execution overhead of SBP is at most half that of the traditional Point-To-Point (PTP) communication method. Moreover, SBP needs on average 60% less buffer memory than PTP.
{"title":"Lightweight Semantics-Preserving Communication for Real-Time Automotive Software","authors":"Eugene Yip, Erjola Lalo, Gerald Lüttgen, A. Sailer","doi":"10.1109/MCSoC.2019.00059","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00059","url":null,"abstract":"The automotive industry is confronting the multi-core challenge, where legacy and modern software must run correctly and efficiently in parallel, by designing their software around the Logical Execution Time (LET) model. While such designs offer implementations that are platform independent and time predictable, task communications are assumed to complete instantaneously. Thus, it is critical to implement timely data transfers between LET tasks, which may be on different cores, in order to preserve a design's data-flow. In this paper, we develop a lightweight Static Buffering Protocol (SBP) that satisfies the LET communication semantics and supports signal-based communication with multiple signal writers. Our simulation-based evaluation with realistic industrial automotive benchmarks shows that the execution overhead of SBP is at most half that of the traditional Point-To-Point (PTP) communication method. Moreover, SBP needs on average 60% less buffer memory than PTP.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131022185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Soft processors are becoming a common component on reconfigurable computing like FPGA. For some accelerators, custom logic functions are implemented as processing elements besides the soft processor. Since the resources in FPGA are fixed and limited, it is desired to implement the soft processor with less logical resources as possible. One of the important parts of the processor is an instruction fetch unit whose performance is dependent on branch prediction. Conventional branch predictors like bimodal or gshare are simple to implement but their prediction accuracy is not good enough. On the other hand, TAGE branch predictor has better prediction accuracy but contains complex logic path for branch prediction, which results in the lower operating frequency. In this paper, we propose a branch predictor called pTAGE, which has almost the same prediction accuracy as TAGE and avoids becoming the critical path of the processor. The branch prediction of pTAGE is pipelined, so prediction result is available on each clock cycle. We implement gshare, TAGE, and pTAGE, respectively in Verilog HDL and evaluate their operating frequency and prediction rate based on FPGA implementation. In this result, pTAGE has almost the same prediction rate as TAGE and 1.41 times higher operating frequency than that of TAGE. Also, we evaluate the performance by varying the latency for updating branch prediction, and the evaluation result shows that pTAGE exhibits higher performance in deep pipelined processors than gshare.
{"title":"An Efficient Implementation of a TAGE Branch Predictor for Soft Processors on FPGA","authors":"Katsunoshin Matsui, Md. Ashraful Islam, Kenji Kise","doi":"10.1109/MCSoC.2019.00023","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00023","url":null,"abstract":"Soft processors are becoming a common component on reconfigurable computing like FPGA. For some accelerators, custom logic functions are implemented as processing elements besides the soft processor. Since the resources in FPGA are fixed and limited, it is desired to implement the soft processor with less logical resources as possible. One of the important parts of the processor is an instruction fetch unit whose performance is dependent on branch prediction. Conventional branch predictors like bimodal or gshare are simple to implement but their prediction accuracy is not good enough. On the other hand, TAGE branch predictor has better prediction accuracy but contains complex logic path for branch prediction, which results in the lower operating frequency. In this paper, we propose a branch predictor called pTAGE, which has almost the same prediction accuracy as TAGE and avoids becoming the critical path of the processor. The branch prediction of pTAGE is pipelined, so prediction result is available on each clock cycle. We implement gshare, TAGE, and pTAGE, respectively in Verilog HDL and evaluate their operating frequency and prediction rate based on FPGA implementation. In this result, pTAGE has almost the same prediction rate as TAGE and 1.41 times higher operating frequency than that of TAGE. Also, we evaluate the performance by varying the latency for updating branch prediction, and the evaluation result shows that pTAGE exhibits higher performance in deep pipelined processors than gshare.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"6 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114606827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A convolutional neural network~(CNN) is one of the most successfully used neural networks and it is widely used for many embedded computer vision tasks. However, it requires a massive number of multiplication and accumulation (MAC) computations with high-power consumption to realize it, and higher recognition accuracy is desired for modern tasks. In the paper, we apply a sparseness technique to generate a weak classifier to build an ensemble CNN. There is a trade-off between recognition accuracy and inference speed, and we control sparse (zero weight) ratio to make an excellent performance and better recognition accuracy. We use P sparse weight CNNs with a dataflow pipeline architecture that hides the performance overhead for multiple CNN evaluation on the ensemble CNN. We set an adequate sparse ratio to adjust the number of operation cycles in each stage. The proposed ensemble CNN depends on the dataset quality and it has different layer configurations. We propose a universal convolution core to realize variations of modern convolutional operations, and extend it to many cores with pipelining architecture to achieve high-throughput operation. Therefore, while computing efficiency is poor on GPUs which is unsuitable for a sparseness convolution, on our universal convolution cores can realize an architecture with excellent pipeline efficiency. We measure the trade-off between recognition accuracy and inference speed using existing benchmark datasets and CNN models. By setting the sparsity ratio and the number of predictors appropriately, high-speed architectures are realized on the many universal covers while the recognition accuracy is improved compared to the conventional single CNN realization. We implemented the prototype of many universal convolution cores on the Xilinx Kintex UltraScale+ FPGA, and compared with the desktop GPU realization of the ensembling, the proposed many core based accelerator for the ensemble sparse CNN is 3.09 times faster, 4.20 times lower power, and 13.33 times better as for the performance per power. Therefore, by realizing the proposed ensemble method with many of universal convolution cores, a high-speed inference could be achieved while improving the recognition accuracy compared with the conventional dense weight CNN on the desktop GPU.
{"title":"Many Universal Convolution Cores for Ensemble Sparse Convolutional Neural Networks","authors":"Ryosuke Kuramochi, Youki Sada, Masayuki Shimoda, Shimpei Sato, Hiroki Nakahara","doi":"10.1109/MCSoC.2019.00021","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00021","url":null,"abstract":"A convolutional neural network~(CNN) is one of the most successfully used neural networks and it is widely used for many embedded computer vision tasks. However, it requires a massive number of multiplication and accumulation (MAC) computations with high-power consumption to realize it, and higher recognition accuracy is desired for modern tasks. In the paper, we apply a sparseness technique to generate a weak classifier to build an ensemble CNN. There is a trade-off between recognition accuracy and inference speed, and we control sparse (zero weight) ratio to make an excellent performance and better recognition accuracy. We use P sparse weight CNNs with a dataflow pipeline architecture that hides the performance overhead for multiple CNN evaluation on the ensemble CNN. We set an adequate sparse ratio to adjust the number of operation cycles in each stage. The proposed ensemble CNN depends on the dataset quality and it has different layer configurations. We propose a universal convolution core to realize variations of modern convolutional operations, and extend it to many cores with pipelining architecture to achieve high-throughput operation. Therefore, while computing efficiency is poor on GPUs which is unsuitable for a sparseness convolution, on our universal convolution cores can realize an architecture with excellent pipeline efficiency. We measure the trade-off between recognition accuracy and inference speed using existing benchmark datasets and CNN models. By setting the sparsity ratio and the number of predictors appropriately, high-speed architectures are realized on the many universal covers while the recognition accuracy is improved compared to the conventional single CNN realization. We implemented the prototype of many universal convolution cores on the Xilinx Kintex UltraScale+ FPGA, and compared with the desktop GPU realization of the ensembling, the proposed many core based accelerator for the ensemble sparse CNN is 3.09 times faster, 4.20 times lower power, and 13.33 times better as for the performance per power. Therefore, by realizing the proposed ensemble method with many of universal convolution cores, a high-speed inference could be achieved while improving the recognition accuracy compared with the conventional dense weight CNN on the desktop GPU.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115681538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/MCSoC.2019.00015
Koki Honda, Kaijie Wei, H. Amano
This paper presents the implementation of lane line detection on FPGA and Python. Lane line detection consists of three functions, median blur, adaptive threshold, and Hough transform. We implemented only accumulation of Hough transform on FPGA. Although the Hough transform cannot be implemented on a low-end FPGA board if implemented directly, by reducing ρθ space, it was successfully implemented on a low-end FPGA board. The rest of the Hough transform was implemented using Python's NumPy and SciPy, and OpenCV. Although it was very easy to write, it did not become a bottleneck for the whole process because of its effectiveness. As a result, we could achieve a 3.9x speedup compared to OpenCV and kept the developing cost down. When implementing median blur and adaptive threshold on an FPGA, we could achieve a 6.34x speedup.
{"title":"FPGA/Python Co-Design for Lane Line Detection on a PYNQ-Z1 Board","authors":"Koki Honda, Kaijie Wei, H. Amano","doi":"10.1109/MCSoC.2019.00015","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00015","url":null,"abstract":"This paper presents the implementation of lane line detection on FPGA and Python. Lane line detection consists of three functions, median blur, adaptive threshold, and Hough transform. We implemented only accumulation of Hough transform on FPGA. Although the Hough transform cannot be implemented on a low-end FPGA board if implemented directly, by reducing ρθ space, it was successfully implemented on a low-end FPGA board. The rest of the Hough transform was implemented using Python's NumPy and SciPy, and OpenCV. Although it was very easy to write, it did not become a bottleneck for the whole process because of its effectiveness. As a result, we could achieve a 3.9x speedup compared to OpenCV and kept the developing cost down. When implementing median blur and adaptive threshold on an FPGA, we could achieve a 6.34x speedup.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116175906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}