首页 > 最新文献

Workshop Proceedings of the 51st International Conference on Parallel Processing最新文献

英文 中文
Towards a GPU accelerated selective sparsity multilayer perceptron algorithm using K-Nearest Neighbors search 基于k近邻搜索的GPU加速选择稀疏多层感知器算法
B. H. Meyer, Wagner M. Nunan Zola
The use of artificial neural networks and deep learning is common in several areas of knowledge. In many situations, it is necessary to use neural networks with many neurons. For example, the Extreme Classification problems can use neural networks that process more than 500,000 classes and inputs with more than 100,000 dimensions, which can make the training process unfeasible due to the high computational cost required. To overcome this limitation, several techniques were proposed in past works, such as the SLIDE algorithm, whose implementation is based on the construction of hash tables and on CPU parallelism. This work proposes the SLIDE-GPU, which replaces the use of hash tables by algorithms that use GPU to search for approximate neighbors, or approximate nearest neighbors (ANN) search. In addition, SLIDE-GPU also proposes the use of GPU to accelerate the activation step of neural networks. Among the experiments carried out, it was possible to notice a training process acceleration of up to 268% in execution time considering the inference accuracy, although currently maintaining the backpropagation phase with CPU processing. This suggests that further acceleration can be obtained in future work, by using massive parallelism in the entire process. The ANN-based technique provides better inference accuracy at each epoch, which helps producing the global acceleration, besides using the GPU in the neuron activation step. The GPU neuron activation acceleration reached a 28.09 times shorter execution time compared to the CPU implementation on this step alone.
人工神经网络和深度学习的使用在一些知识领域很常见。在许多情况下,有必要使用具有许多神经元的神经网络。例如,极端分类问题可以使用处理超过500,000个类和超过100,000个维度的输入的神经网络,这可能会使训练过程由于所需的高计算成本而变得不可行的。为了克服这一限制,在过去的工作中提出了几种技术,例如SLIDE算法,其实现基于哈希表的构造和CPU并行性。这项工作提出了SLIDE-GPU,它通过使用GPU搜索近似邻居或近似近邻(ANN)搜索的算法取代哈希表的使用。此外,SLIDE-GPU还提出利用GPU加速神经网络的激活步骤。在进行的实验中,考虑到推理精度,可以注意到训练过程在执行时间上的加速高达268%,尽管目前仍保持CPU处理的反向传播阶段。这表明,在未来的工作中,通过在整个过程中使用大规模并行,可以获得进一步的加速。除了在神经元激活步骤中使用GPU之外,基于人工神经网络的技术在每个epoch提供了更好的推理精度,这有助于产生全局加速。与CPU实现相比,GPU神经元激活加速在这一步上的执行时间缩短了28.09倍。
{"title":"Towards a GPU accelerated selective sparsity multilayer perceptron algorithm using K-Nearest Neighbors search","authors":"B. H. Meyer, Wagner M. Nunan Zola","doi":"10.1145/3547276.3548634","DOIUrl":"https://doi.org/10.1145/3547276.3548634","url":null,"abstract":"The use of artificial neural networks and deep learning is common in several areas of knowledge. In many situations, it is necessary to use neural networks with many neurons. For example, the Extreme Classification problems can use neural networks that process more than 500,000 classes and inputs with more than 100,000 dimensions, which can make the training process unfeasible due to the high computational cost required. To overcome this limitation, several techniques were proposed in past works, such as the SLIDE algorithm, whose implementation is based on the construction of hash tables and on CPU parallelism. This work proposes the SLIDE-GPU, which replaces the use of hash tables by algorithms that use GPU to search for approximate neighbors, or approximate nearest neighbors (ANN) search. In addition, SLIDE-GPU also proposes the use of GPU to accelerate the activation step of neural networks. Among the experiments carried out, it was possible to notice a training process acceleration of up to 268% in execution time considering the inference accuracy, although currently maintaining the backpropagation phase with CPU processing. This suggests that further acceleration can be obtained in future work, by using massive parallelism in the entire process. The ANN-based technique provides better inference accuracy at each epoch, which helps producing the global acceleration, besides using the GPU in the neuron activation step. The GPU neuron activation acceleration reached a 28.09 times shorter execution time compared to the CPU implementation on this step alone.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129570742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Support of the Scan Vector Model for RISC-V Vector Extension RISC-V矢量扩展中扫描矢量模型的有效支持
Hung-Ming Lai, Jenq-Kuen Lee
RISC-V vector extension (RVV) provides wide vector registers, which is applicable for workloads with high data-level parallelism such as machine learning or cloud computing. However, it is not easy for developers to fully utilize the underlying performance of a new architecture. Hence, abstractions such as primitives or software frameworks could be employed to ease this burden. Scan, also known as all-prefix-sum, is a common building block for many parallel algorithms. Blelloch presented an algorithmic model called the scan vector model, which uses scan operations as primitives, and demonstrates that a broad range of applications and algorithms can be implemented by them. In our work, we present an efficient support of the scan vector model for RVV. With this support, parallel algorithms can be developed upon those primitives without knowing the details of RVV while gaining the performance that RVV provides. In addition, we provide an optimization scheme related to the length multiplier feature of RVV, which can further improve the utilization of the vector register files. The experiment shows that our support of scan and segmented scan for RVV can achieve 2.85x and 4.29x speedup, respectively, compared to the sequential implementation. With further optimization using the length multiplier of RVV, we can improve the previous result to 21.93x and 15.09x speedup.
RISC-V矢量扩展(RVV)提供宽矢量寄存器,适用于机器学习或云计算等高数据级并行工作负载。然而,开发人员要充分利用新体系结构的底层性能并不容易。因此,可以使用诸如原语或软件框架之类的抽象来减轻这种负担。扫描,也称为全前缀和,是许多并行算法的常见构建块。Blelloch提出了一种称为扫描向量模型的算法模型,该模型使用扫描操作作为原语,并证明了广泛的应用程序和算法可以通过它们来实现。在我们的工作中,我们提出了一种有效的支持扫描向量模型的RVV。有了这种支持,就可以在不了解RVV细节的情况下,在这些原语的基础上开发并行算法,同时获得RVV提供的性能。此外,我们还提供了一种与RVV的长度乘法器特性相关的优化方案,可以进一步提高矢量寄存器文件的利用率。实验表明,我们对RVV扫描和分段扫描的支持,与顺序实现相比,分别可以实现2.85倍和4.29倍的加速。通过使用RVV的长度乘法器进一步优化,我们可以将之前的结果提高到21.93倍和15.09倍的加速。
{"title":"Efficient Support of the Scan Vector Model for RISC-V Vector Extension","authors":"Hung-Ming Lai, Jenq-Kuen Lee","doi":"10.1145/3547276.3548518","DOIUrl":"https://doi.org/10.1145/3547276.3548518","url":null,"abstract":"RISC-V vector extension (RVV) provides wide vector registers, which is applicable for workloads with high data-level parallelism such as machine learning or cloud computing. However, it is not easy for developers to fully utilize the underlying performance of a new architecture. Hence, abstractions such as primitives or software frameworks could be employed to ease this burden. Scan, also known as all-prefix-sum, is a common building block for many parallel algorithms. Blelloch presented an algorithmic model called the scan vector model, which uses scan operations as primitives, and demonstrates that a broad range of applications and algorithms can be implemented by them. In our work, we present an efficient support of the scan vector model for RVV. With this support, parallel algorithms can be developed upon those primitives without knowing the details of RVV while gaining the performance that RVV provides. In addition, we provide an optimization scheme related to the length multiplier feature of RVV, which can further improve the utilization of the vector register files. The experiment shows that our support of scan and segmented scan for RVV can achieve 2.85x and 4.29x speedup, respectively, compared to the sequential implementation. With further optimization using the length multiplier of RVV, we can improve the previous result to 21.93x and 15.09x speedup.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127011867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Hybrid Data-flow Visual Programing Language* 一种混合数据流可视化编程语言*
Hongxin Wang, Qiuming Luo, Zheng Du
In this paper, we introduced a Hybrid Data-flow Visual Programing Language (HDVPL), which is an extended C/C++ language with a visual frontend and a dataflow runtime library. Although, most of the popular dataflow visual programming languages are designed for specialized purposes, HDVPL is for general-purpose programming. Unlike the others, the dataflow node behavior of HDVPL can be customized by programmer. Our intuitive visual interface can easily build a general-purpose dataflow program. It provides a visual editor to create nodes and connect them to form a DAG of dataflow task. This makes the beginner of computer programming capable of building parallel programs easily. With subgraph feature, complex hierarchical graphs can be built with container node. After the whole program is accomplished, the HDVPL can translate it into text-based source code and compile it into object file, which will be linked with HDVPL dataflow runtime library. To visualize dataflow programs in runtime, we integrated our dataflow library with frontend visual editor. The visual frontend will show the detailed information about the running program in console window.
本文介绍了一种混合数据流可视化编程语言(HDVPL),它是一种扩展的C/ c++语言,具有可视化前端和数据流运行库。尽管大多数流行的数据流可视化编程语言都是为特定目的而设计的,但HDVPL是为通用目的编程而设计的。与其他协议不同的是,hdpl的数据流节点行为可以由程序员自定义。我们直观的可视化界面可以很容易地构建一个通用的数据流程序。它提供了一个可视化编辑器来创建节点并将它们连接起来以形成数据流任务的DAG。这使得计算机编程的初学者能够轻松地构建并行程序。利用子图特性,可以利用容器节点构建复杂的层次图。整个程序完成后,HDVPL可以将其翻译成基于文本的源代码,并将其编译成目标文件,该目标文件将与HDVPL数据流运行库链接。为了在运行时可视化数据流程序,我们将数据流库与前端可视化编辑器集成在一起。可视化前端将在控制台窗口中显示有关运行程序的详细信息。
{"title":"A Hybrid Data-flow Visual Programing Language*","authors":"Hongxin Wang, Qiuming Luo, Zheng Du","doi":"10.1145/3547276.3548525","DOIUrl":"https://doi.org/10.1145/3547276.3548525","url":null,"abstract":"In this paper, we introduced a Hybrid Data-flow Visual Programing Language (HDVPL), which is an extended C/C++ language with a visual frontend and a dataflow runtime library. Although, most of the popular dataflow visual programming languages are designed for specialized purposes, HDVPL is for general-purpose programming. Unlike the others, the dataflow node behavior of HDVPL can be customized by programmer. Our intuitive visual interface can easily build a general-purpose dataflow program. It provides a visual editor to create nodes and connect them to form a DAG of dataflow task. This makes the beginner of computer programming capable of building parallel programs easily. With subgraph feature, complex hierarchical graphs can be built with container node. After the whole program is accomplished, the HDVPL can translate it into text-based source code and compile it into object file, which will be linked with HDVPL dataflow runtime library. To visualize dataflow programs in runtime, we integrated our dataflow library with frontend visual editor. The visual frontend will show the detailed information about the running program in console window.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133894347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Fast and Secure AKA Protocol for B5G B5G快速安全的AKA协议
Jung-Hsien Wu, Jie Yang, Yung-Chin Chang, Min-Te Sun
With the popularity of mobile devices, the mobile service requirements are now changing rapidly. This implies that the micro network operator dedicated to a specific sector of users has the potential to improve the 5G architecture in terms of scalability and autonomy. However, the traditional AKA protocol does not allow the micro operator to authenticate mobile users independently. To solve this problem, we propose the Fast AKA protocol, which disseminates a subscriber’s profile among base stations via a Blockchain and mutually authenticates the subscriber and serving base station locally for roaming. The proposed architecture speeds up the authentication process, provides forward/backward secrecy, and resists replay attack as well as man-in-the-middle attack. We believe that Fast AKA can serve as a cornerstone for B5G.
随着移动设备的普及,移动业务需求也在快速变化。这意味着,专注于特定用户领域的微网络运营商有潜力在可扩展性和自主性方面改进5G架构。然而,传统的AKA协议不允许微运营商独立对移动用户进行身份验证。为了解决这个问题,我们提出了Fast AKA协议,该协议通过区块链在基站之间传播用户的配置文件,并对用户和本地服务基站进行相互认证以进行漫游。所提出的体系结构加快了身份验证过程,提供了前向/后向保密,并且能够抵抗重放攻击和中间人攻击。我们相信Fast AKA可以成为B5G的基石。
{"title":"A Fast and Secure AKA Protocol for B5G","authors":"Jung-Hsien Wu, Jie Yang, Yung-Chin Chang, Min-Te Sun","doi":"10.1145/3547276.3548440","DOIUrl":"https://doi.org/10.1145/3547276.3548440","url":null,"abstract":"With the popularity of mobile devices, the mobile service requirements are now changing rapidly. This implies that the micro network operator dedicated to a specific sector of users has the potential to improve the 5G architecture in terms of scalability and autonomy. However, the traditional AKA protocol does not allow the micro operator to authenticate mobile users independently. To solve this problem, we propose the Fast AKA protocol, which disseminates a subscriber’s profile among base stations via a Blockchain and mutually authenticates the subscriber and serving base station locally for roaming. The proposed architecture speeds up the authentication process, provides forward/backward secrecy, and resists replay attack as well as man-in-the-middle attack. We believe that Fast AKA can serve as a cornerstone for B5G.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"120 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115839598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A User-Based Bike Return Algorithm for Docked Bike Sharing Systems 基于用户的共享单车返回算法
Donghui Chen, Kazuya Sakai
Recently, the development of Internet connection, intelligence, and sharing in the bicycle industry has assisted bike sharing systems (BSS’s) in establishing a connection between public transport hubs. In this paper, we propose a novel user-based bike return (UBR) algorithm for docked BSS’s which leverages a dynamic price adjustment mechanism so that the system is able to rebalance the number of lent and returned bikes by itself at different docks nearby. The proposed scheme motivates users to return their bikes to other underflow docks close-by their target destinations through a cheaper plan to compensate the shortage in them. Consequentially, the bike sharing system is able to achieve dynamic self-balance and the operational cost of the entire system for operators is reduced while the satisfaction of users is significantly increased. The simulations are conducted using real traces, called Citi Bike, and the results demonstrate that the proposed UBR achieves its design goals.
最近,自行车行业的互联网连接、智能和共享的发展帮助自行车共享系统(BSS)建立了公共交通枢纽之间的连接。在本文中,我们提出了一种新的基于用户的自行车归还算法,该算法利用动态价格调节机制,使系统能够在附近不同的码头自行平衡借出和归还的自行车数量。拟议中的计划鼓励用户将自行车归还到目标目的地附近的其他底流码头,通过一个更便宜的计划来弥补它们的短缺。从而使共享单车系统能够实现动态自平衡,降低了整个系统的运营成本,同时显著提高了用户的满意度。模拟使用真实的轨迹,称为Citi Bike,结果表明,所提出的UBR达到了其设计目标。
{"title":"A User-Based Bike Return Algorithm for Docked Bike Sharing Systems","authors":"Donghui Chen, Kazuya Sakai","doi":"10.1145/3547276.3548443","DOIUrl":"https://doi.org/10.1145/3547276.3548443","url":null,"abstract":"Recently, the development of Internet connection, intelligence, and sharing in the bicycle industry has assisted bike sharing systems (BSS’s) in establishing a connection between public transport hubs. In this paper, we propose a novel user-based bike return (UBR) algorithm for docked BSS’s which leverages a dynamic price adjustment mechanism so that the system is able to rebalance the number of lent and returned bikes by itself at different docks nearby. The proposed scheme motivates users to return their bikes to other underflow docks close-by their target destinations through a cheaper plan to compensate the shortage in them. Consequentially, the bike sharing system is able to achieve dynamic self-balance and the operational cost of the entire system for operators is reduced while the satisfaction of users is significantly increased. The simulations are conducted using real traces, called Citi Bike, and the results demonstrate that the proposed UBR achieves its design goals.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116190324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OpenMP Offloading in the Jetson Nano Platform Jetson Nano平台的OpenMP卸载
Ilias K. Kasmeridis, V. Dimakopoulos
The nvidia Jetson Nano is a very popular system-on-module and developer kit which brings high-performance specs in a small and power-efficient embedded platform. Integrating a 128-core gpu and a quad-core cpu, it provides enough capabilities to support computationally demanding applications such as AI inference, deep learning and computer vision. While the Jetson Nano family supports a number of apis and libraries out of the box, comprehensive support of OpenMP, one of the most popular apis, is not readily available. In this work we present the implementation of an OpenMP infrastructure that is able to harness both the cpu and the gpu of a Jetson Nano board using the offload facilities of the recent versions of the OpenMP specifications. We discuss the compiler-side transformations of key constructs, the generation of cuda-based code as well as how the runtime support is provided. We also provide experimental results for a number of applications, exhibiting performance comparable with their pure cuda versions.
nvidia Jetson Nano是一个非常受欢迎的系统模块和开发工具包,它在一个小而节能的嵌入式平台上带来了高性能的规格。它集成了128核gpu和四核cpu,提供了足够的能力来支持人工智能推理、深度学习和计算机视觉等计算要求很高的应用程序。虽然Jetson Nano系列支持许多开箱即用的api和库,但对最流行的api之一OpenMP的全面支持并不容易获得。在这项工作中,我们提出了一个OpenMP基础设施的实现,该基础设施能够利用Jetson Nano板的cpu和gpu,使用最新版本的OpenMP规范的卸载设施。我们将讨论关键结构的编译器端转换、基于cuda的代码的生成以及如何提供运行时支持。我们还提供了一些应用程序的实验结果,显示出与纯cuda版本相当的性能。
{"title":"OpenMP Offloading in the Jetson Nano Platform","authors":"Ilias K. Kasmeridis, V. Dimakopoulos","doi":"10.1145/3547276.3548517","DOIUrl":"https://doi.org/10.1145/3547276.3548517","url":null,"abstract":"The nvidia Jetson Nano is a very popular system-on-module and developer kit which brings high-performance specs in a small and power-efficient embedded platform. Integrating a 128-core gpu and a quad-core cpu, it provides enough capabilities to support computationally demanding applications such as AI inference, deep learning and computer vision. While the Jetson Nano family supports a number of apis and libraries out of the box, comprehensive support of OpenMP, one of the most popular apis, is not readily available. In this work we present the implementation of an OpenMP infrastructure that is able to harness both the cpu and the gpu of a Jetson Nano board using the offload facilities of the recent versions of the OpenMP specifications. We discuss the compiler-side transformations of key constructs, the generation of cuda-based code as well as how the runtime support is provided. We also provide experimental results for a number of applications, exhibiting performance comparable with their pure cuda versions.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126680829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Extracting High Definition Map Information from Aerial Images 从航拍图像中提取高清地图信息
Guan-Wen Chen, Hsueh-Yi Lai, Tsì-Uí İk
Compared with traditional digital maps, high definition maps (HD maps) collect information in lane-level instead of road-level, and provide more diverse and detailed road network information, including lane markings, speed limits, rules, and intersection junction. HD maps can be used for driving navigation and autonomous driving cars with high-precision information to improve driving safety. However, it takes a lot of time to construct the HD map, so that the HD map cannot be widely used in applications at present. This paper proposes a method to identify road information through semantic image segmentation algorithm from aerial traffic images, and then convert it into the open source HD map standard format, which is OpenDRIVE. Through experiments, 13 categories of lane markings can be identified with mIoU of 84.3% and mPA of 89.6%.
与传统的数字地图相比,高清地图在车道层面而不是道路层面收集信息,提供更多样化和详细的道路网络信息,包括车道标记、限速、规则、十字路口等。高清地图可以用于驾驶导航和高精度信息的自动驾驶汽车,以提高驾驶安全性。但是,由于构建高清地图需要耗费大量的时间,使得高清地图目前还不能在应用中得到广泛应用。本文提出了一种通过语义图像分割算法从航空交通图像中识别道路信息,并将其转换为开源高清地图标准格式opdrive的方法。通过实验,可以识别出13类车道标记,mIoU为84.3%,mPA为89.6%。
{"title":"Extracting High Definition Map Information from Aerial Images","authors":"Guan-Wen Chen, Hsueh-Yi Lai, Tsì-Uí İk","doi":"10.1145/3547276.3548442","DOIUrl":"https://doi.org/10.1145/3547276.3548442","url":null,"abstract":"Compared with traditional digital maps, high definition maps (HD maps) collect information in lane-level instead of road-level, and provide more diverse and detailed road network information, including lane markings, speed limits, rules, and intersection junction. HD maps can be used for driving navigation and autonomous driving cars with high-precision information to improve driving safety. However, it takes a lot of time to construct the HD map, so that the HD map cannot be widely used in applications at present. This paper proposes a method to identify road information through semantic image segmentation algorithm from aerial traffic images, and then convert it into the open source HD map standard format, which is OpenDRIVE. Through experiments, 13 categories of lane markings can be identified with mIoU of 84.3% and mPA of 89.6%.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116259525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Runtime Techniques for Automatic Process Virtualization 自动过程虚拟化的运行时技术
Evan Ramos, Sam White, A. Bhosale, L. Kalé
Asynchronous many-task runtimes look promising for the next generation of high performance computing systems. But these runtimes are usually based on new programming models, requiring extensive programmer effort to port existing applications to them. An alternative approach is to reimagine the execution model of widely used programming APIs, such as MPI, in order to execute them more asynchronously. Virtualization is a powerful technique that can be used to execute a bulk synchronous parallel program in an asynchronous manner. Moreover, if the virtualized entities can be migrated between address spaces, the runtime can optimize execution with dynamic load balancing, fault tolerance, and other adaptive techniques. Previous work on automating process virtualization has explored compiler approaches, source-to-source refactoring tools, and runtime methods. These approaches achieve virtualization with different tradeoffs in terms of portability (across different architectures, operating systems, compilers, and linkers), programmer effort required, and the ability to handle all different kinds of global state and programming languages. We implement support for three different related runtime methods, discuss shortcomings and their applicability to user-level virtualized process migration, and compare performance to existing approaches. Compared to existing approaches, one of our new methods achieves what we consider the best overall functionality in terms of portability, automation, support for migration, and runtime performance.
异步多任务运行时对于下一代高性能计算系统很有前景。但是这些运行时通常基于新的编程模型,需要程序员付出大量的努力才能将现有的应用程序移植到它们上面。另一种方法是重新构想广泛使用的编程api(如MPI)的执行模型,以便以更异步的方式执行它们。虚拟化是一种强大的技术,可用于以异步方式执行批量同步并行程序。此外,如果虚拟实体可以在地址空间之间迁移,则运行时可以使用动态负载平衡、容错和其他自适应技术来优化执行。之前关于自动化流程虚拟化的工作已经探索了编译器方法、源到源重构工具和运行时方法。这些方法在可移植性(跨不同的体系结构、操作系统、编译器和链接器)、所需的程序员工作以及处理所有不同类型的全局状态和编程语言的能力方面进行了不同的权衡,从而实现了虚拟化。我们实现了对三种不同的相关运行时方法的支持,讨论了缺点及其对用户级虚拟化进程迁移的适用性,并将性能与现有方法进行了比较。与现有的方法相比,我们的一种新方法在可移植性、自动化、支持迁移和运行时性能方面实现了我们认为最好的整体功能。
{"title":"Runtime Techniques for Automatic Process Virtualization","authors":"Evan Ramos, Sam White, A. Bhosale, L. Kalé","doi":"10.1145/3547276.3548522","DOIUrl":"https://doi.org/10.1145/3547276.3548522","url":null,"abstract":"Asynchronous many-task runtimes look promising for the next generation of high performance computing systems. But these runtimes are usually based on new programming models, requiring extensive programmer effort to port existing applications to them. An alternative approach is to reimagine the execution model of widely used programming APIs, such as MPI, in order to execute them more asynchronously. Virtualization is a powerful technique that can be used to execute a bulk synchronous parallel program in an asynchronous manner. Moreover, if the virtualized entities can be migrated between address spaces, the runtime can optimize execution with dynamic load balancing, fault tolerance, and other adaptive techniques. Previous work on automating process virtualization has explored compiler approaches, source-to-source refactoring tools, and runtime methods. These approaches achieve virtualization with different tradeoffs in terms of portability (across different architectures, operating systems, compilers, and linkers), programmer effort required, and the ability to handle all different kinds of global state and programming languages. We implement support for three different related runtime methods, discuss shortcomings and their applicability to user-level virtualized process migration, and compare performance to existing approaches. Compared to existing approaches, one of our new methods achieves what we consider the best overall functionality in terms of portability, automation, support for migration, and runtime performance.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121200917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Training reinforcement learning models via an adversarial evolutionary algorithm 通过对抗进化算法训练强化学习模型
M. Coletti, Chathika Gunaratne, Catherine D. Schuman, Robert M. Patton
When training for control problems, more episodes used in training usually leads to better generalizability, but more episodes also requires significantly more training time. There are a variety of approaches for selecting the way that training episodes are chosen, including fixed episodes, uniform sampling, and stochastic sampling, but they can all leave gaps in the training landscape. In this work, we describe an approach that leverages an adversarial evolutionary algorithm to identify the worst performing states for a given model. We then use information about these states in the next cycle of training, which is repeated until the desired level of model performance is met. We demonstrate this approach with the OpenAI Gym cart-pole problem. We show that the adversarial evolutionary algorithm did not reduce the number of episodes required in training needed to attain model generalizability when compared with stochastic sampling, and actually performed slightly worse.
当训练控制问题时,训练中使用的片段越多,通常会有更好的泛化性,但更多的片段也需要更多的训练时间。选择训练集的方法有很多种,包括固定集、均匀抽样和随机抽样,但它们都可能在训练中留下空白。在这项工作中,我们描述了一种利用对抗进化算法来识别给定模型的最差表现状态的方法。然后,我们在下一个训练周期中使用关于这些状态的信息,直到满足所需的模型性能水平。我们用OpenAI Gym的车杆问题演示了这种方法。我们表明,与随机抽样相比,对抗进化算法并没有减少获得模型泛化性所需的训练集数,实际上表现略差。
{"title":"Training reinforcement learning models via an adversarial evolutionary algorithm","authors":"M. Coletti, Chathika Gunaratne, Catherine D. Schuman, Robert M. Patton","doi":"10.1145/3547276.3548635","DOIUrl":"https://doi.org/10.1145/3547276.3548635","url":null,"abstract":"When training for control problems, more episodes used in training usually leads to better generalizability, but more episodes also requires significantly more training time. There are a variety of approaches for selecting the way that training episodes are chosen, including fixed episodes, uniform sampling, and stochastic sampling, but they can all leave gaps in the training landscape. In this work, we describe an approach that leverages an adversarial evolutionary algorithm to identify the worst performing states for a given model. We then use information about these states in the next cycle of training, which is repeated until the desired level of model performance is met. We demonstrate this approach with the OpenAI Gym cart-pole problem. We show that the adversarial evolutionary algorithm did not reduce the number of episodes required in training needed to attain model generalizability when compared with stochastic sampling, and actually performed slightly worse.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"402 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133610297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early Experiences 使用rCUDA的远程GPU虚拟化系统中的流水线压缩:早期经验
Cristian Peñaranda Cebrián, C. Reaño, F. Silla
The amount of Internet of Things (IoT) devices has been increasing in the last years. These are usually low-performance devices with slow network connections. A common improvement is therefore to perform some computations at the edge of the network (e.g. preprocessing data), thereby reducing the amount of data sent through the network. To enhance the computing capabilities of edge devices, remote virtual Graphics Processing Units (GPUs) can be used. Thus, edge devices can leverage GPUs installed in remote computers. However, this solution requires exchanging data with the remote GPU across the network, which as mentioned is typically slow. In this paper we present a novel approach to improve communication performance of edge devices using rCUDA remote GPU virtualization framework. We implement within this framework on-the-fly pipelined data compression, which is done transparently to applications. We use four popular machine learning samples to carry out an initial performance exploration. The analysis is done using a slow 10 Mbps network to emulate the conditions of these devices. Early results show potential improvements provided some current issues are addressed.
物联网(IoT)设备的数量在过去几年中一直在增加。这些设备通常性能较低,网络连接速度较慢。因此,一个常见的改进是在网络边缘执行一些计算(例如预处理数据),从而减少通过网络发送的数据量。为了增强边缘设备的计算能力,可以使用远程虚拟图形处理单元(gpu)。因此,边缘设备可以利用安装在远程计算机中的gpu。然而,这种解决方案需要通过网络与远程GPU交换数据,如上所述,这通常很慢。本文提出了一种利用rCUDA远程GPU虚拟化框架提高边缘设备通信性能的新方法。我们在这个框架内实现实时的流水线数据压缩,这对应用程序是透明的。我们使用四个流行的机器学习样本来进行初步的性能探索。分析是使用10 Mbps的慢速网络来模拟这些设备的情况。早期结果显示,如果解决了当前的一些问题,可能会有改进。
{"title":"Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early Experiences","authors":"Cristian Peñaranda Cebrián, C. Reaño, F. Silla","doi":"10.1145/3547276.3548628","DOIUrl":"https://doi.org/10.1145/3547276.3548628","url":null,"abstract":"The amount of Internet of Things (IoT) devices has been increasing in the last years. These are usually low-performance devices with slow network connections. A common improvement is therefore to perform some computations at the edge of the network (e.g. preprocessing data), thereby reducing the amount of data sent through the network. To enhance the computing capabilities of edge devices, remote virtual Graphics Processing Units (GPUs) can be used. Thus, edge devices can leverage GPUs installed in remote computers. However, this solution requires exchanging data with the remote GPU across the network, which as mentioned is typically slow. In this paper we present a novel approach to improve communication performance of edge devices using rCUDA remote GPU virtualization framework. We implement within this framework on-the-fly pipelined data compression, which is done transparently to applications. We use four popular machine learning samples to carry out an initial performance exploration. The analysis is done using a slow 10 Mbps network to emulate the conditions of these devices. Early results show potential improvements provided some current issues are addressed.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134316450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Workshop Proceedings of the 51st International Conference on Parallel Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1