The ability to generate visual representations of data, and the ability to enhance data into a suitable form for the purpose of visual representation, form two key components in a scientific visualization system. By a visual representation we mean the ability to render the data, using visual cues, such that the important features are readily perceived by the user. By the ability to enhance data we mean the ability to apply transformations to the data so that salient features embedded in the data become discernible and quantifiable. The rendering of data, computer graphics, and the enhancement of data, image processing, have emerged over the last twenty years into separate scientific disciplines. However, in scientific visualization and other applications of empirical data interpretation, we are increasingly confronted with the need to combine both data rendering and data transformation capabilities under one system framework. This paper describes the design issues and implementation of a program for visualizing and enhancing volume data on distributed memory architectures. Our design is motivated by the desire to interactively view, transform, and interpret volume data acquired using seismic imaging techniques. Experimental results derived from an implementation on the Connection Machine CM-5 are described.
{"title":"Integrating volume data analysis and rendering on distributed memory architectures","authors":"E. Camahort, I. Chakravarty","doi":"10.1109/PRS.1993.586092","DOIUrl":"https://doi.org/10.1109/PRS.1993.586092","url":null,"abstract":"The ability to generate visual representations of data, and the ability to enhance data into a suitable form for the purpose of visual representation, form two key components in a scientific visualization system. By a visual representation we mean the ability to render the data, using visual cues, such that the important features are readily perceived by the user. By the ability to enhance data we mean the ability to apply transformations to the data so that salient features embedded in the data become discernible and quantifiable. The rendering of data, computer graphics, and the enhancement of data, image processing, have emerged over the last twenty years into separate scientific disciplines. However, in scientific visualization and other applications of empirical data interpretation, we are increasingly confronted with the need to combine both data rendering and data transformation capabilities under one system framework. This paper describes the design issues and implementation of a program for visualizing and enhancing volume data on distributed memory architectures. Our design is motivated by the desire to interactively view, transform, and interpret volume data acquired using seismic imaging techniques. Experimental results derived from an implementation on the Connection Machine CM-5 are described.","PeriodicalId":394370,"journal":{"name":"Proceedings of 1993 IEEE Parallel Rendering Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125820727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work examines the network performance of mesh-connected multicomputers applied to parallel volume rendering algorithms. This issue has not been addressed in papers describing particular parallel implementations, but is pertinent to anyone designing or implementing parallel rendering algorithms. Parallel volume rendering algorithms fall into two main classes-image and object partitions. Communication requirements for algorithms in these classes are analyzed. Network performance for these algorithms is estimated by using an existing model of mesh network behavior. The performance estimates are verified by tests on the Touchstone Delta. The results indicate that, for a fixed screen size, the performance of 2D mesh networks scales very well then used with object partition algorithms-the time required for communication actually decreases as the data and system sizes increase. A Touchstone Delta implementation of an object partition algorithm is briefly described to illustrate the algorithm's low communication requirements.
{"title":"Parallel volume-rendering algorithm performance on mesh-connected multicomputers","authors":"U. Neumann","doi":"10.1145/166181.166196","DOIUrl":"https://doi.org/10.1145/166181.166196","url":null,"abstract":"This work examines the network performance of mesh-connected multicomputers applied to parallel volume rendering algorithms. This issue has not been addressed in papers describing particular parallel implementations, but is pertinent to anyone designing or implementing parallel rendering algorithms. Parallel volume rendering algorithms fall into two main classes-image and object partitions. Communication requirements for algorithms in these classes are analyzed. Network performance for these algorithms is estimated by using an existing model of mesh network behavior. The performance estimates are verified by tests on the Touchstone Delta. The results indicate that, for a fixed screen size, the performance of 2D mesh networks scales very well then used with object partition algorithms-the time required for communication actually decreases as the data and system sizes increase. A Touchstone Delta implementation of an object partition algorithm is briefly described to illustrate the algorithm's low communication requirements.","PeriodicalId":394370,"journal":{"name":"Proceedings of 1993 IEEE Parallel Rendering Symposium","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115254998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a new multicomputer polygon rendering algorithm that is specialized for interactive applications. The algorithm differs from previous algorithms in two ways. First, it load balances the rasterization once per frame, instead of as the frame progresses, using the previous frame's distribution of polygons on the screen as input to the load-balancing algorithm. Second, it uses a new message sending scheme that reduces the number of messages required. These characteristics mean that the algorithm only requires global synchronization between frames, which allows for higher frame rates. The algorithm was selected using a simulator which confirmed that using the previous frame's polygon distribution on the screen is nearly as good as using the current frame's distribution. The algorithm is implemented on Caltech's Intel Touchstone Delta, a 512 processor multicomputer system, and preliminary performance figures are given. The highest performance achieved to date is 930,000 triangles per second using 256 processors and a 806,640 triangle data set.
{"title":"A multicomputer polygon rendering algorithm for interactive applications","authors":"D. Ellsworth","doi":"10.1145/166181.166187","DOIUrl":"https://doi.org/10.1145/166181.166187","url":null,"abstract":"This paper presents a new multicomputer polygon rendering algorithm that is specialized for interactive applications. The algorithm differs from previous algorithms in two ways. First, it load balances the rasterization once per frame, instead of as the frame progresses, using the previous frame's distribution of polygons on the screen as input to the load-balancing algorithm. Second, it uses a new message sending scheme that reduces the number of messages required. These characteristics mean that the algorithm only requires global synchronization between frames, which allows for higher frame rates. The algorithm was selected using a simulator which confirmed that using the previous frame's polygon distribution on the screen is nearly as good as using the current frame's distribution. The algorithm is implemented on Caltech's Intel Touchstone Delta, a 512 processor multicomputer system, and preliminary performance figures are given. The highest performance achieved to date is 930,000 triangles per second using 256 processors and a 806,640 triangle data set.","PeriodicalId":394370,"journal":{"name":"Proceedings of 1993 IEEE Parallel Rendering Symposium","volume":"242 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114452812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the purely object-parallel approach to multiprocessor rendering, each processor is assigned responsibility to render a subset of the graphics database. When rendering is complete, pixels from the processors must be merged and globally z-buffered. On an arbitrary multiprocessor interconnection network, the straightforward algorithm for pixel merging requires d/sup -/A total network bandwidth per frame, where d/sup -/ is the depth complexity of the scene and A is the area of the screen or window. This algorithm is used by the Kubota Pacific Denali and appears to be used by the Evans and Sutherland Freedom series. An alternative algorithm, the PixelFlow algorithm, requires nA network bandwidth per frame, where n is the number of processors. But the merging is pipelined in PixelFlow so that each network link must only support A bandwidth per frame. However, that algorithm requires a separate special-purpose network for pixel merging. In this paper we present and analyze an expected-case log (d/sup -/)A algorithm for pixel merging that uses network broadcast, and we discuss the algorithm's applicability to shared-memory bus architectures.
{"title":"Pixel merging for object-parallel rendering: A distributed snooping algorithm","authors":"M. Cox, P. Hanrahan","doi":"10.1145/166181.166188","DOIUrl":"https://doi.org/10.1145/166181.166188","url":null,"abstract":"In the purely object-parallel approach to multiprocessor rendering, each processor is assigned responsibility to render a subset of the graphics database. When rendering is complete, pixels from the processors must be merged and globally z-buffered. On an arbitrary multiprocessor interconnection network, the straightforward algorithm for pixel merging requires d/sup -/A total network bandwidth per frame, where d/sup -/ is the depth complexity of the scene and A is the area of the screen or window. This algorithm is used by the Kubota Pacific Denali and appears to be used by the Evans and Sutherland Freedom series. An alternative algorithm, the PixelFlow algorithm, requires nA network bandwidth per frame, where n is the number of processors. But the merging is pipelined in PixelFlow so that each network link must only support A bandwidth per frame. However, that algorithm requires a separate special-purpose network for pixel merging. In this paper we present and analyze an expected-case log (d/sup -/)A algorithm for pixel merging that uses network broadcast, and we discuss the algorithm's applicability to shared-memory bus architectures.","PeriodicalId":394370,"journal":{"name":"Proceedings of 1993 IEEE Parallel Rendering Symposium","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116021986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a parallel rendering algorithm targeted to MIMD distributed-memory message-passing architectures. For maximum performance, the algorithm exploits both object-level and image level parallelism. The behavior of the algorithm is examined both analytically and experimentally. The results show that the choice of message size has a significant impact on performance. Scalability to large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 confirms the analytical results and demonstrates increasing performance from 1 to 128 processors across a wide range of scene complexities.
{"title":"A MIMD rendering algorithm for distributed memory architectures","authors":"T. Crockett, T. Orloff","doi":"10.1145/166181.166186","DOIUrl":"https://doi.org/10.1145/166181.166186","url":null,"abstract":"We present a parallel rendering algorithm targeted to MIMD distributed-memory message-passing architectures. For maximum performance, the algorithm exploits both object-level and image level parallelism. The behavior of the algorithm is examined both analytically and experimentally. The results show that the choice of message size has a significant impact on performance. Scalability to large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 confirms the analytical results and demonstrates increasing performance from 1 to 128 processors across a wide range of scene complexities.","PeriodicalId":394370,"journal":{"name":"Proceedings of 1993 IEEE Parallel Rendering Symposium","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121239964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The progressive refinement method is investigated for parallelization on ring-connected multicomputers. A synchronous scheme, based on static task assignment, is proposed, in order to achieve better coherence during the parallel light distribution computations. An efficient global circulation scheme is proposed for the parallel light distribution computations, which reduces the total volume of concurrent communication by an asymptotical factor. The proposed parallel algorithm is implemented on a ring-embedded Intel's PSC/2 hypercube multicomputer. Load balance quality of the proposed static assignment schemes are evaluated experimentally. The effect of coherence in the parallel light distribution computations on the shooting patch selection sequence is also investigated.
{"title":"Progressive refinement radiosity on ring-connected multicomputers","authors":"T. Çapin, C. Aykanat, B. Özgüç","doi":"10.1145/166181.166192","DOIUrl":"https://doi.org/10.1145/166181.166192","url":null,"abstract":"The progressive refinement method is investigated for parallelization on ring-connected multicomputers. A synchronous scheme, based on static task assignment, is proposed, in order to achieve better coherence during the parallel light distribution computations. An efficient global circulation scheme is proposed for the parallel light distribution computations, which reduces the total volume of concurrent communication by an asymptotical factor. The proposed parallel algorithm is implemented on a ring-embedded Intel's PSC/2 hypercube multicomputer. Load balance quality of the proposed static assignment schemes are evaluated experimentally. The effect of coherence in the parallel light distribution computations on the shooting patch selection sequence is also investigated.","PeriodicalId":394370,"journal":{"name":"Proceedings of 1993 IEEE Parallel Rendering Symposium","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125021397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a graphics renderer which incorporates new partitioning methodologies of memory and work for efficient execution on a parallel computer. The task adaptive domain decomposition scheme is an image space method involving dynamic partitioning of rectangular pixel area tasks. We show that this method requires little overhead, allows coherence within a parallel context, handles worst case scenarios with reasonable speedup, executes efficiently, and requires minimal processor synchronization. The implementation analysis indicates that load imbalance is the major cause of performance degradation at the higher processor counts. Even so, on a variety of test scenes, an average rendering speedup of 79 was achieved utilizing 96 processors on the BBN TC2000 multiprocessor with processor efficiency ranging from 66% to 94%.
{"title":"A task adaptive parallel graphics renderer","authors":"S. Whitman","doi":"10.1145/166181.166185","DOIUrl":"https://doi.org/10.1145/166181.166185","url":null,"abstract":"This paper presents a graphics renderer which incorporates new partitioning methodologies of memory and work for efficient execution on a parallel computer. The task adaptive domain decomposition scheme is an image space method involving dynamic partitioning of rectangular pixel area tasks. We show that this method requires little overhead, allows coherence within a parallel context, handles worst case scenarios with reasonable speedup, executes efficiently, and requires minimal processor synchronization. The implementation analysis indicates that load imbalance is the major cause of performance degradation at the higher processor counts. Even so, on a variety of test scenes, an average rendering speedup of 79 was achieved utilizing 96 processors on the BBN TC2000 multiprocessor with processor efficiency ranging from 66% to 94%.","PeriodicalId":394370,"journal":{"name":"Proceedings of 1993 IEEE Parallel Rendering Symposium","volume":"39 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120889131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The ray-tracing algorithm produces high quality images by taking multiple luminous effects into account. Hence, it requires many computations and a large memory capacity. The use of parallel machines is a solution in order to reduce significantly the synthesis time. Distributed Memory Parallel Computers offer an interesting performance/cost ratio but need to distribute computations and data. This paper is a study of the implementation of the ray-tracing algorithm on a Distributed Memory Parallel Computer. An original solution, based on the association of a data parallelism approach with a task parallelism one, is presented. A dynamic load redistribution mechanism allows us to ensure a good load balance during the synthesis phase. At the end of the paper, some results of our transputer implementation are presented.
{"title":"An efficient parallel ray tracing scheme for distributed memory parallel computers","authors":"W. Lefer","doi":"10.1145/166181.166193","DOIUrl":"https://doi.org/10.1145/166181.166193","url":null,"abstract":"The ray-tracing algorithm produces high quality images by taking multiple luminous effects into account. Hence, it requires many computations and a large memory capacity. The use of parallel machines is a solution in order to reduce significantly the synthesis time. Distributed Memory Parallel Computers offer an interesting performance/cost ratio but need to distribute computations and data. This paper is a study of the implementation of the ray-tracing algorithm on a Distributed Memory Parallel Computer. An original solution, based on the association of a data parallelism approach with a task parallelism one, is presented. A dynamic load redistribution mechanism allows us to ensure a good load balance during the synthesis phase. At the end of the paper, some results of our transputer implementation are presented.","PeriodicalId":394370,"journal":{"name":"Proceedings of 1993 IEEE Parallel Rendering Symposium","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115040526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A scalable approach to parallel volume raycasting of structured and unstructured computational grids is presented. The algorithm is general enough to handle non-convex grids and cells, grids with voids, grids constructed from multiple grids, and embedded geometrical primitives. The algorithm is designed for a highly parallel MIMD architecture which features both local memory and shared memory with nonuniform access times. It has been implemented on a BBN TC2000 and benchmarked on several datasets. A variation of the algorithm which provides fast image updates for a changing transfer function is also presented. A distributed approach to controlling the execution of the volume render is used and the graphical user interface designed for this purpose is briefly described.
{"title":"Scalable parallel volume raycasting for nonrectilinear computational grids","authors":"J. Challinger","doi":"10.1145/166181.166194","DOIUrl":"https://doi.org/10.1145/166181.166194","url":null,"abstract":"A scalable approach to parallel volume raycasting of structured and unstructured computational grids is presented. The algorithm is general enough to handle non-convex grids and cells, grids with voids, grids constructed from multiple grids, and embedded geometrical primitives. The algorithm is designed for a highly parallel MIMD architecture which features both local memory and shared memory with nonuniform access times. It has been implemented on a BBN TC2000 and benchmarked on several datasets. A variation of the algorithm which provides fast image updates for a changing transfer function is also presented. A distributed approach to controlling the execution of the volume render is used and the graphical user interface designed for this purpose is briefly described.","PeriodicalId":394370,"journal":{"name":"Proceedings of 1993 IEEE Parallel Rendering Symposium","volume":"49 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121196706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a divide-and-conquer ray-traced volume rendering algorithm and a parallel image compositing method, along with their implementation and performance on the Connection Machine CM-5, and networked workstations. This algorithm distributes both the data and the computations to individual processing units to achieve fast, high-quality rendering of high-resolution data. The volume data, once distributed, is left intact. The processing nodes perform local raytracing of their subvolume concurrently. No communication between processing units is needed during this locally ray-tracing process. A subimage is generated by each processing unit and the final image is obtained by compositing subimages in the proper order, which can be determined a priori. Test results on the CM-5 and a group of networked workstations demonstrate the practicality of our rendering algorithm and compositing method.
{"title":"A data distributed, parallel algorithm for ray-traced volume rendering","authors":"K. Ma, J. Painter, C. Hansen, M. Krogh","doi":"10.1145/166181.166183","DOIUrl":"https://doi.org/10.1145/166181.166183","url":null,"abstract":"This paper presents a divide-and-conquer ray-traced volume rendering algorithm and a parallel image compositing method, along with their implementation and performance on the Connection Machine CM-5, and networked workstations. This algorithm distributes both the data and the computations to individual processing units to achieve fast, high-quality rendering of high-resolution data. The volume data, once distributed, is left intact. The processing nodes perform local raytracing of their subvolume concurrently. No communication between processing units is needed during this locally ray-tracing process. A subimage is generated by each processing unit and the final image is obtained by compositing subimages in the proper order, which can be determined a priori. Test results on the CM-5 and a group of networked workstations demonstrate the practicality of our rendering algorithm and compositing method.","PeriodicalId":394370,"journal":{"name":"Proceedings of 1993 IEEE Parallel Rendering Symposium","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131476537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}