Combined scientific CFD simulation and interactive raytracing with OpenCL

International Workshop on OpenCL Pub Date : 2022-05-10 DOI:10.1145/3529538.3529542

Moritz Lehmann

{"title":"Combined scientific CFD simulation and interactive raytracing with OpenCL","authors":"Moritz Lehmann","doi":"10.1145/3529538.3529542","DOIUrl":null,"url":null,"abstract":"One of the main uses for OpenCL is (scientific) compute applications where graphical rendering is done externally, after the simulation has finished. However separating simulation and rendering has many disadvantages, especially the extreme slowdown caused by copying simulation data from device to host, and needing to store raw data on the hard drive, taking up hundreds of gigabyte, just to visualize preliminary results. A much faster approach is to implement both simulation and rendering in OpenCL. The rendering kernels have direct read-only access to the raw simulation data that resides in ultra-fast GPU memory. This eliminates all PCIe data transfer but camera parameters and finished frames, allowing for interactive visualization of simulation results in real time while the simulation is running. This is an invaluable tool for rapid prototyping. Although OpenCL does not have existing functionality for graphical rendering, being a general compute language, it allows for implementing an entire graphics engine, such that no data has to be moved to the CPU during rendering. On top, specific low-level optimizations make this OpenCL graphics engine outperform any existing rendering solution for this scenario, enabling drawing billions of lines per second and fluid raytracing in real time on even non-RTX GPUs. This combination of simulation and rendering in OpenCL is demonstrated with the software FluidX3D [3] - a lattice Boltzmann method (LBM) fluid dynamics solver. The first part will briefly introduce the numerical method for simulating fluid flow in a physically accurate manner. After introducing the LBM, the optimizations to make it run at peak efficiency are discussed: Being a memory-bound algorithm, coalesced memory access is key. This is achieved through array-of-structures data layout as well as the one-step-pull scheme, a certain variant of the LBM streaming step. One-step-pull leverages the fact that the misaligned read penalty is much smaller than the misaligned write penalty on almost all GPUs. Roofline analysis shows that with these optimizations, the LBM runs at 100% efficiency on the fastest data-center and gaming GPUs [5]. To simulate free surface flows, the LBM is extended with the Volume-of-Fluid (VoF) model. An efficient algorithm has been designed to vastly accelerate the challenging surface tension computation [4]. This extremely efficient VoF-LBM GPU implementation allows covering new grounds in science: FluidX3D has been used to simulate more than 1600 raindrop impacts to statistically evaluate how microplastics transition from the ocean surface into the atmosphere when the spray droplets are generated during drop impact [6]. At the same power consumption, with existing CPU-parallelized codes, compute time would have been several years, whilst with FluidX3D it was about a week. The second part will focus on real time rendering with OpenCL, especially raytracing. Rasterization on the GPU is parallelized not over pixels but lines/triangles instead, making runtime mostly independent of screen resolution and lightning fast. Each line/triangle is transformed with the camera parameters from 3D to 2D screen coordinates and then rasterized onto the frame (integer array) with Bresenham algorithm [2] and z-buffer. The raytracing graphics are based on a combination of fast ray-grid traversal and marching-cubes, leveraging that the computational grid from the LBM already is an ideal acceleration structure for raytracing. The idea of raytracing is simple: Through each pixel on the screen, shoot a reverse light ray out of the camera and see where it intersects with a surface in the scene. Then (recursively) calculate reflected/refracted rays and mix the colors. If a ray doesn’t intersect with anything, its color is determined by the skybox image via UV mapping and bilinear pixel interpolation. With mesh surfaces consisting of many triangles, computation time quickly becomes a problem, as for each ray all triangles have to be tested for intersection. To overcome this, an acceleration structure is required. While computer games often use a bounding volume hierarchy, the LBM already provides an ideal alternative acceleration structure: the simulation grid. The corresponding algorithm is called ray-grid traversal: When a ray shoots through the 3D grid, intersections with the surface only have to be checked for at each traversed grid cell rather than the entire grid. In each traversed grid cell, the 0-5 surface triangles are generated on-the-fly with the marching-cubes algorithm and ray-triangle intersections are checked with the Möller-Trumbore algorithm. If an intersection has been found, only afterwards the normals are calculated on the 8 grid points spanning the cell, and are trilinearly interpolated to the intersection coordinates. The so interpolated surface normal makes the raytraced surface appear perfectly smooth. On the GPU, the ray(s) for each pixel on screen are computed in parallel, vastly speeding up rendering. It is of key importance how to align the OpenCL workgroups on the 2D array of screen pixels: best performance is achieved for 8x8 pixel tiles; this is about 50% faster than 64x1 tiles, because with small, square-ish tiles, all rays of the workgroup are more likely to traverse the same grid cells, greatly improving memory broadcasting. In ray-grid traversal, 8 isovalues spanning a cell have to be loaded from GPU memory for each traversed cell. Once the triangle intersection has been found, the gradient on each of the 8 cell isovalues is calculated with central differences. Instead of loading an additional 6 isovalues for each of the 8 grid points, their isovalues are reused such that only 24 additional isovalues are loaded. For marching-cubes, the algorithm by Paul Bourke [1] is implemented in OpenCL. With 16-/8-bit integers, bit-packing and symmetry, the tables are reduced to 1/8 of their original size and stored in constant memory space. For computing the cube index, branching is eliminated by bit operations. The Möller-Trumbore algorithm [7] is implemented in an entirely branchless manner. This raytracing implementation is fast enough to run in real time for even the largest lattice dimensions that fit into the memory of a GPU. Finally, the combined VoF-LBM simulation and raytracing implementation is demonstrated on the most realistic simulation of an impacting raindrop ever done [8].","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"7 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3529538.3529542","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

One of the main uses for OpenCL is (scientific) compute applications where graphical rendering is done externally, after the simulation has finished. However separating simulation and rendering has many disadvantages, especially the extreme slowdown caused by copying simulation data from device to host, and needing to store raw data on the hard drive, taking up hundreds of gigabyte, just to visualize preliminary results. A much faster approach is to implement both simulation and rendering in OpenCL. The rendering kernels have direct read-only access to the raw simulation data that resides in ultra-fast GPU memory. This eliminates all PCIe data transfer but camera parameters and finished frames, allowing for interactive visualization of simulation results in real time while the simulation is running. This is an invaluable tool for rapid prototyping. Although OpenCL does not have existing functionality for graphical rendering, being a general compute language, it allows for implementing an entire graphics engine, such that no data has to be moved to the CPU during rendering. On top, specific low-level optimizations make this OpenCL graphics engine outperform any existing rendering solution for this scenario, enabling drawing billions of lines per second and fluid raytracing in real time on even non-RTX GPUs. This combination of simulation and rendering in OpenCL is demonstrated with the software FluidX3D [3] - a lattice Boltzmann method (LBM) fluid dynamics solver. The first part will briefly introduce the numerical method for simulating fluid flow in a physically accurate manner. After introducing the LBM, the optimizations to make it run at peak efficiency are discussed: Being a memory-bound algorithm, coalesced memory access is key. This is achieved through array-of-structures data layout as well as the one-step-pull scheme, a certain variant of the LBM streaming step. One-step-pull leverages the fact that the misaligned read penalty is much smaller than the misaligned write penalty on almost all GPUs. Roofline analysis shows that with these optimizations, the LBM runs at 100% efficiency on the fastest data-center and gaming GPUs [5]. To simulate free surface flows, the LBM is extended with the Volume-of-Fluid (VoF) model. An efficient algorithm has been designed to vastly accelerate the challenging surface tension computation [4]. This extremely efficient VoF-LBM GPU implementation allows covering new grounds in science: FluidX3D has been used to simulate more than 1600 raindrop impacts to statistically evaluate how microplastics transition from the ocean surface into the atmosphere when the spray droplets are generated during drop impact [6]. At the same power consumption, with existing CPU-parallelized codes, compute time would have been several years, whilst with FluidX3D it was about a week. The second part will focus on real time rendering with OpenCL, especially raytracing. Rasterization on the GPU is parallelized not over pixels but lines/triangles instead, making runtime mostly independent of screen resolution and lightning fast. Each line/triangle is transformed with the camera parameters from 3D to 2D screen coordinates and then rasterized onto the frame (integer array) with Bresenham algorithm [2] and z-buffer. The raytracing graphics are based on a combination of fast ray-grid traversal and marching-cubes, leveraging that the computational grid from the LBM already is an ideal acceleration structure for raytracing. The idea of raytracing is simple: Through each pixel on the screen, shoot a reverse light ray out of the camera and see where it intersects with a surface in the scene. Then (recursively) calculate reflected/refracted rays and mix the colors. If a ray doesn’t intersect with anything, its color is determined by the skybox image via UV mapping and bilinear pixel interpolation. With mesh surfaces consisting of many triangles, computation time quickly becomes a problem, as for each ray all triangles have to be tested for intersection. To overcome this, an acceleration structure is required. While computer games often use a bounding volume hierarchy, the LBM already provides an ideal alternative acceleration structure: the simulation grid. The corresponding algorithm is called ray-grid traversal: When a ray shoots through the 3D grid, intersections with the surface only have to be checked for at each traversed grid cell rather than the entire grid. In each traversed grid cell, the 0-5 surface triangles are generated on-the-fly with the marching-cubes algorithm and ray-triangle intersections are checked with the Möller-Trumbore algorithm. If an intersection has been found, only afterwards the normals are calculated on the 8 grid points spanning the cell, and are trilinearly interpolated to the intersection coordinates. The so interpolated surface normal makes the raytraced surface appear perfectly smooth. On the GPU, the ray(s) for each pixel on screen are computed in parallel, vastly speeding up rendering. It is of key importance how to align the OpenCL workgroups on the 2D array of screen pixels: best performance is achieved for 8x8 pixel tiles; this is about 50% faster than 64x1 tiles, because with small, square-ish tiles, all rays of the workgroup are more likely to traverse the same grid cells, greatly improving memory broadcasting. In ray-grid traversal, 8 isovalues spanning a cell have to be loaded from GPU memory for each traversed cell. Once the triangle intersection has been found, the gradient on each of the 8 cell isovalues is calculated with central differences. Instead of loading an additional 6 isovalues for each of the 8 grid points, their isovalues are reused such that only 24 additional isovalues are loaded. For marching-cubes, the algorithm by Paul Bourke [1] is implemented in OpenCL. With 16-/8-bit integers, bit-packing and symmetry, the tables are reduced to 1/8 of their original size and stored in constant memory space. For computing the cube index, branching is eliminated by bit operations. The Möller-Trumbore algorithm [7] is implemented in an entirely branchless manner. This raytracing implementation is fast enough to run in real time for even the largest lattice dimensions that fit into the memory of a GPU. Finally, the combined VoF-LBM simulation and raytracing implementation is demonstrated on the most realistic simulation of an impacting raindrop ever done [8].

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

结合科学CFD模拟和交互式光线追踪与OpenCL

OpenCL的主要用途之一是(科学的)计算应用程序，其中图形渲染是在仿真完成后在外部完成的。然而，将模拟和渲染分开有许多缺点，特别是由于将模拟数据从设备复制到主机，并且需要将原始数据存储在硬盘驱动器上，占用数百gb，只是为了可视化初步结果而导致的极端减速。一个更快的方法是在OpenCL中同时实现模拟和渲染。渲染内核可以直接只读访问驻留在超高速GPU内存中的原始模拟数据。这消除了除摄像机参数和完成帧外的所有PCIe数据传输，允许在模拟运行时实时交互式可视化模拟结果。这是快速原型制作的宝贵工具。虽然OpenCL没有现有的图形渲染功能，但作为一种通用的计算语言，它允许实现一个完整的图形引擎，这样在渲染期间就不需要将数据移动到CPU。最重要的是，特定的低级优化使得这个OpenCL图形引擎在这种情况下比任何现有的渲染解决方案都要好，每秒可以绘制数十亿行，甚至在非rtx gpu上也可以实时绘制流体光线追踪。通过软件FluidX3D[3] -晶格玻尔兹曼方法(LBM)流体动力学求解器，在OpenCL中演示了这种模拟和渲染的结合。第一部分将简要介绍以物理精确方式模拟流体流动的数值方法。在介绍了LBM之后，讨论了使其以最高效率运行的优化:作为一种内存约束算法，合并内存访问是关键。这是通过结构数组数据布局和一步拉拔方案来实现的，一步拉拔方案是LBM流步骤的某种变体。一步拉式利用了这样一个事实，即在几乎所有gpu上，不对齐的读损失比不对齐的写损失要小得多。rooline分析表明，通过这些优化，LBM在最快的数据中心和游戏gpu上以100%的效率运行[5]。为了模拟自由表面流动，将LBM扩展为流体体积(VoF)模型。设计了一种有效的算法来极大地加速具有挑战性的表面张力计算[4]。这种极其高效的VoF-LBM GPU实现可以覆盖科学领域的新领域:FluidX3D已被用于模拟1600多个雨滴撞击，以统计评估当水滴撞击期间产生喷雾液滴时，微塑料如何从海洋表面过渡到大气中[6]。在相同的功耗下，使用现有的cpu并行代码，计算时间可能需要几年，而使用FluidX3D大约需要一周。第二部分将重点介绍OpenCL的实时渲染，特别是光线追踪。GPU上的栅格化不是在像素上并行化，而是在线/三角形上并行化，这使得运行时基本上与屏幕分辨率无关，而且速度极快。每条线/三角形用相机参数从3D屏幕坐标转换为2D屏幕坐标，然后用Bresenham算法[2]和z-buffer光栅化到帧(整数数组)上。光线追踪图形基于快速光线网格遍历和移动立方体的组合，利用来自LBM的计算网格已经是光线追踪的理想加速结构。光线追踪的想法很简单:通过屏幕上的每个像素，从相机中射出一条反向光线，看看它与场景中的表面相交的地方。然后(递归地)计算反射/折射光线并混合颜色。如果光线不与任何物体相交，则其颜色由天空盒图像通过UV映射和双线性像素插值确定。对于由许多三角形组成的网格表面，计算时间很快成为一个问题，因为对于每条光线，所有三角形都必须进行相交测试。为了克服这个问题，需要一个加速结构。虽然电脑游戏经常使用边界体积层次结构，但LBM已经提供了一种理想的替代加速结构:模拟网格。相应的算法称为射线-网格遍历:当射线穿过三维网格时，只需检查每个遍历的网格单元与表面的交点，而不必检查整个网格。在每个遍历的网格单元中，使用行进立方体算法实时生成0-5个表面三角形，并使用Möller-Trumbore算法检查射线三角形的交点。如果找到了一个交点，只有在之后才会在跨越单元格的8个网格点上计算法线，并对交点坐标进行三线性插值。这样插值的表面法线使光线追踪的表面看起来非常光滑。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Workshop on OpenCL

自引率

0.00%

发文量

期刊最新文献

Improving Performance Portability of the Procedurally Generated High Energy Physics Event Generator MadGraph Using SYCL Acceleration of Quantum Transport Simulations with OpenCL CodePin: An Instrumentation-Based Debug Tool of SYCLomatic An Efficient Approach to Resolving Stack Overflow of SYCL Kernel on Intel® CPUs Ray Tracer based lidar simulation using SYCL