Pub Date : 2011-11-01DOI: 10.1109/DASIP.2011.6136904
Antoine Pedron, L. Lacassagne, F. Bimbard, S. Berre
The CIVA software platform developed by CEA-LIST offers various simulation and data processing modules dedicated to non-destructive testing (NDT). In particular, ultrasonic imaging and reconstruction tools are proposed, in the purpose of localizing echoes and identifying and sizing the detected defects. Because of the complexity of data processed, computation time is now a limitation for the optimal use of available information. In this article, we present performance results on parallelization of one computationally heavy algorithm on general purpose processors (GPP) and graphic processing units (GPU). GPU implementation makes an intensive use of atomic intrinsics. Compared to initial GPP implementation, optimized GPP implementation runs up to ×116 faster and GPU implementation up to ×631. This shows that, even with irregular workloads, combining software optimization and hardware improvements, GPU give high performance.
{"title":"Parallelization of an ultrasound reconstruction algorithm for non destructive testing on multicore CPU and GPU","authors":"Antoine Pedron, L. Lacassagne, F. Bimbard, S. Berre","doi":"10.1109/DASIP.2011.6136904","DOIUrl":"https://doi.org/10.1109/DASIP.2011.6136904","url":null,"abstract":"The CIVA software platform developed by CEA-LIST offers various simulation and data processing modules dedicated to non-destructive testing (NDT). In particular, ultrasonic imaging and reconstruction tools are proposed, in the purpose of localizing echoes and identifying and sizing the detected defects. Because of the complexity of data processed, computation time is now a limitation for the optimal use of available information. In this article, we present performance results on parallelization of one computationally heavy algorithm on general purpose processors (GPP) and graphic processing units (GPU). GPU implementation makes an intensive use of atomic intrinsics. Compared to initial GPP implementation, optimized GPP implementation runs up to ×116 faster and GPU implementation up to ×631. This shows that, even with irregular workloads, combining software optimization and hardware improvements, GPU give high performance.","PeriodicalId":199500,"journal":{"name":"Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126766162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-01DOI: 10.1109/DASIP.2011.6136853
B. Ouni, C. Belleudy, S. Bilavarn, E. Senn
In this paper, a flow of characterization of embedded operating system's energy consumption is presented. The objective is to determine the energy overhead of the services of the embedded OS, we interest particularly on the context switch service. The modeling is based on measurements on the hardware platform OMAP35x EVM board, running Linux omap. Based on the analysis results, a relationship between energy overhead and a set of hardware and software parameters is established.
{"title":"Embedded operating systems energy overhead","authors":"B. Ouni, C. Belleudy, S. Bilavarn, E. Senn","doi":"10.1109/DASIP.2011.6136853","DOIUrl":"https://doi.org/10.1109/DASIP.2011.6136853","url":null,"abstract":"In this paper, a flow of characterization of embedded operating system's energy consumption is presented. The objective is to determine the energy overhead of the services of the embedded OS, we interest particularly on the context switch service. The modeling is based on measurements on the hardware platform OMAP35x EVM board, running Linux omap. Based on the analysis results, a relationship between energy overhead and a set of hardware and software parameters is established.","PeriodicalId":199500,"journal":{"name":"Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)","volume":"257 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115794997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-01DOI: 10.1109/DASIP.2011.6136878
A. Rahman, Hossam Amer, A. Prihozhy, Christophe Lucarz, M. Mattavelli
Signal processing designs are becoming increasingly complex with demands for more advanced algorithms. Designers are now seeking high-level tools and methodology to help manage complexity and increase productivity. Recently, CAL dataflow language has been specified which is capable of synthesizing dataflow description into RTL codes for hardware implementation, and based on several case studies, have shown promising results. However, no work has been done on global network analysis, which could increase the optimization space. In this paper, we introduce methodologies to analyze and optimize CAL programs by determining which actions should be parallelized, pipelined, or refactored for the highest throughput gain, and then providing tools and techniques to achieve this using minimum resource. As a case study on the RVC MPEG-4 SP Intra decoder for implementation on Virtex-5 FPGA, experimental results confirmed our analysis with throughput gain of up to 3.5x using relatively-minor additional slice compared to the reference design.
{"title":"Optimization methodologies for complex FPGA-based signal processing systems with CAL","authors":"A. Rahman, Hossam Amer, A. Prihozhy, Christophe Lucarz, M. Mattavelli","doi":"10.1109/DASIP.2011.6136878","DOIUrl":"https://doi.org/10.1109/DASIP.2011.6136878","url":null,"abstract":"Signal processing designs are becoming increasingly complex with demands for more advanced algorithms. Designers are now seeking high-level tools and methodology to help manage complexity and increase productivity. Recently, CAL dataflow language has been specified which is capable of synthesizing dataflow description into RTL codes for hardware implementation, and based on several case studies, have shown promising results. However, no work has been done on global network analysis, which could increase the optimization space. In this paper, we introduce methodologies to analyze and optimize CAL programs by determining which actions should be parallelized, pipelined, or refactored for the highest throughput gain, and then providing tools and techniques to achieve this using minimum resource. As a case study on the RVC MPEG-4 SP Intra decoder for implementation on Virtex-5 FPGA, experimental results confirmed our analysis with throughput gain of up to 3.5x using relatively-minor additional slice compared to the reference design.","PeriodicalId":199500,"journal":{"name":"Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132068331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-01DOI: 10.1109/DASIP.2011.6136866
A. Darji, R. Bansal, S. Merchant, A. Chandorkar
The lifting scheme reduces the computational complexity for computing Discrete Wavelet Transform (DWT) compared to convolution. We have proposed a high performance and memory efficient architecture with parallel scanning method for 2-D DWT using 5/3 Lifting wavelet. This 2-D architecture is composed with two 1-D DWT units and a Transpose Unit (TU). Proposed parallel scanning reduces requirement of on-chip line buffer compared to other line based scanning. Proposed 2-D DWT architecture utilizes only 2N size buffer for NxN sized image, which is low compare to 3.5N usual requirement for to implement 5/3 Lifting wavelet. This is achieved by performing column and row transform simultaneously. Designed 1-D DWT module can process two inputs at a time and produce two outputs per clock which reduces latency significantly compared to other 2-D dual scan based DWT architectures. Designed TU operates at half clock rate which reduces power and its design is independent of size of input image. Instead of shifter we propose Hardwired Scaling Unit (HSU) for coefficient multiplication. Unlike shift register unit this design saves clocks and helps in reducing power by great amount. This architecture is synthesized using Xilinx ISE 10.1 and is implemented on Virtex-IIPRO XC2VP30 FPGA. Very low FPGA resource utilization is found.
{"title":"High speed VLSI architecture for 2-D lifting Discrete Wavelet Transform","authors":"A. Darji, R. Bansal, S. Merchant, A. Chandorkar","doi":"10.1109/DASIP.2011.6136866","DOIUrl":"https://doi.org/10.1109/DASIP.2011.6136866","url":null,"abstract":"The lifting scheme reduces the computational complexity for computing Discrete Wavelet Transform (DWT) compared to convolution. We have proposed a high performance and memory efficient architecture with parallel scanning method for 2-D DWT using 5/3 Lifting wavelet. This 2-D architecture is composed with two 1-D DWT units and a Transpose Unit (TU). Proposed parallel scanning reduces requirement of on-chip line buffer compared to other line based scanning. Proposed 2-D DWT architecture utilizes only 2N size buffer for NxN sized image, which is low compare to 3.5N usual requirement for to implement 5/3 Lifting wavelet. This is achieved by performing column and row transform simultaneously. Designed 1-D DWT module can process two inputs at a time and produce two outputs per clock which reduces latency significantly compared to other 2-D dual scan based DWT architectures. Designed TU operates at half clock rate which reduces power and its design is independent of size of input image. Instead of shifter we propose Hardwired Scaling Unit (HSU) for coefficient multiplication. Unlike shift register unit this design saves clocks and helps in reducing power by great amount. This architecture is synthesized using Xilinx ISE 10.1 and is implemented on Virtex-IIPRO XC2VP30 FPGA. Very low FPGA resource utilization is found.","PeriodicalId":199500,"journal":{"name":"Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123872570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-01DOI: 10.1109/DASIP.2011.6136868
Chenglong Xiao, E. Casseau
In recent years, the use of extensible processors has been increased. Extensible processors extend the base instruction set of a general-purpose processor with a set of custom instructions. Custom instructions that can be implemented in special hardware units make it possible to improve performance and decrease power consumption in extensible processors. The key issue involved is to generate and select automatically the custom instructions from a high-level application code. However, enumerating all possible custom instructions of a given dataflow graph is a computationally difficult problem. In this paper, we propose an efficient algorithm for the exact enumeration of maximal convex custom instructions. The state of the art algorithms use either a bottom-up manner or a top-down manner to solve the problem. The proposed algorithm enumerates all maximal convex custom instructions by using a sandwich manner that combines the advantage of the bottom-up manner and the top-down manner. Compared to the latest algorithm, our algorithm can achieve orders of magnitude speedup.
{"title":"Efficient maximal convex custom instruction enumeration for extensible processors","authors":"Chenglong Xiao, E. Casseau","doi":"10.1109/DASIP.2011.6136868","DOIUrl":"https://doi.org/10.1109/DASIP.2011.6136868","url":null,"abstract":"In recent years, the use of extensible processors has been increased. Extensible processors extend the base instruction set of a general-purpose processor with a set of custom instructions. Custom instructions that can be implemented in special hardware units make it possible to improve performance and decrease power consumption in extensible processors. The key issue involved is to generate and select automatically the custom instructions from a high-level application code. However, enumerating all possible custom instructions of a given dataflow graph is a computationally difficult problem. In this paper, we propose an efficient algorithm for the exact enumeration of maximal convex custom instructions. The state of the art algorithms use either a bottom-up manner or a top-down manner to solve the problem. The proposed algorithm enumerates all maximal convex custom instructions by using a sandwich manner that combines the advantage of the bottom-up manner and the top-down manner. Compared to the latest algorithm, our algorithm can achieve orders of magnitude speedup.","PeriodicalId":199500,"journal":{"name":"Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130114980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-01DOI: 10.1109/DASIP.2011.6136859
Sylvain Huet, Vincent Boulos, V. Fristot, L. Salvo
Nowadays, it is possible to build a multi-GPU supercomputer, well suited for implementation of digital signal processing algorithms, for a few thousand dollars. However, to achieve the highest performance with this kind of architecture, the programmer has to focus on inter-processor communications, tasks synchronization … In this paper, we propose a design flow allowing an efficient implementation of a Digital Signal Processing (DSP) application specified as a Data Flow Graph (DFG) on a multi GPU computer cluster. We focus particularly on the effective implementation of communications by automating the computation-communication overlap, which can lead to significant speedups as shown in the presented benchmark. The approach is validated on a 3D granulometry application developed for research on materials.
{"title":"DFG implementation on multi GPU cluster with computation-communication overlap","authors":"Sylvain Huet, Vincent Boulos, V. Fristot, L. Salvo","doi":"10.1109/DASIP.2011.6136859","DOIUrl":"https://doi.org/10.1109/DASIP.2011.6136859","url":null,"abstract":"Nowadays, it is possible to build a multi-GPU supercomputer, well suited for implementation of digital signal processing algorithms, for a few thousand dollars. However, to achieve the highest performance with this kind of architecture, the programmer has to focus on inter-processor communications, tasks synchronization … In this paper, we propose a design flow allowing an efficient implementation of a Digital Signal Processing (DSP) application specified as a Data Flow Graph (DFG) on a multi GPU computer cluster. We focus particularly on the effective implementation of communications by automating the computation-communication overlap, which can lead to significant speedups as shown in the presented benchmark. The approach is validated on a 3D granulometry application developed for research on materials.","PeriodicalId":199500,"journal":{"name":"Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117304455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-01DOI: 10.1109/DASIP.2011.6136860
Youngsub Ko, Youngmin Yi, S. Ha
H.264/AVC video encoders have been widely used for its high coding efficiency. Since the computational demand proportional to the frame resolution is constantly increasing, it has been of great interest to accelerate H.264/AVC by parallel processing. Recently, graphics processing units (GPUs) have emerged as a viable target for accelerating general purpose applications by exploiting fine-grain data parallelisms. Despite extensive research effort to use GPUs to accelerate the H.264/AVC algorithm, it has not been successful to achieve any speed-up over the x264 algorithm that is known as the fastest CPU implementation because of significant communication overhead between the host CPU and the GPU and intra-frame dependency in the algorithm. In this paper, we propose a novel motion estimation (ME) algorithm tailored for NVIDIA GPU implementation. It is accompanied by a novel pipelining technique, called sub-frame ME processing, to effectively hide the communication overhead between the host CPU and the GPU. The proposed H.264 encoder achieves more than 20% speed-up compared with x264.
{"title":"An efficient parallel motion estimation algorithm and X264 parallelization in CUDA","authors":"Youngsub Ko, Youngmin Yi, S. Ha","doi":"10.1109/DASIP.2011.6136860","DOIUrl":"https://doi.org/10.1109/DASIP.2011.6136860","url":null,"abstract":"H.264/AVC video encoders have been widely used for its high coding efficiency. Since the computational demand proportional to the frame resolution is constantly increasing, it has been of great interest to accelerate H.264/AVC by parallel processing. Recently, graphics processing units (GPUs) have emerged as a viable target for accelerating general purpose applications by exploiting fine-grain data parallelisms. Despite extensive research effort to use GPUs to accelerate the H.264/AVC algorithm, it has not been successful to achieve any speed-up over the x264 algorithm that is known as the fastest CPU implementation because of significant communication overhead between the host CPU and the GPU and intra-frame dependency in the algorithm. In this paper, we propose a novel motion estimation (ME) algorithm tailored for NVIDIA GPU implementation. It is accompanied by a novel pipelining technique, called sub-frame ME processing, to effectively hide the communication overhead between the host CPU and the GPU. The proposed H.264 encoder achieves more than 20% speed-up compared with x264.","PeriodicalId":199500,"journal":{"name":"Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)","volume":"21 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120905395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-01DOI: 10.1109/DASIP.2011.6136862
E. Cannella, O. Derin, T. Stefanov
We investigate and propose a number of different middleware approaches, namely virtual connector, virtual connector with variable rate, and request-driven, which implement the semantics of Kahn Process Networks on Network-on-Chip architectures. All of the presented solutions allow for run-time system adaptivity. We implement the approaches on a Network-on-Chip multiprocessor platform prototyped on an FPGA. Their comparison in terms of the introduced overhead is presented on two case studies with different communication characteristics. We found out that the virtual connector mechanism outperforms other approaches in the communication-intensive application. In the other case study, which has a higher computation/communication ratio, the middleware approaches show similar performance.
{"title":"Middleware approaches for adaptivity of Kahn Process Networks on Networks-on-Chip","authors":"E. Cannella, O. Derin, T. Stefanov","doi":"10.1109/DASIP.2011.6136862","DOIUrl":"https://doi.org/10.1109/DASIP.2011.6136862","url":null,"abstract":"We investigate and propose a number of different middleware approaches, namely virtual connector, virtual connector with variable rate, and request-driven, which implement the semantics of Kahn Process Networks on Network-on-Chip architectures. All of the presented solutions allow for run-time system adaptivity. We implement the approaches on a Network-on-Chip multiprocessor platform prototyped on an FPGA. Their comparison in terms of the introduced overhead is presented on two case studies with different communication characteristics. We found out that the virtual connector mechanism outperforms other approaches in the communication-intensive application. In the other case study, which has a higher computation/communication ratio, the middleware approaches show similar performance.","PeriodicalId":199500,"journal":{"name":"Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127277425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-01DOI: 10.1109/DASIP.2011.6136847
J. Peeters, N. Ventroux, Tanguy Sassolas, L. Lacassagne
Increasingly complex systems need parallelized simulation engines. In the context of SystemC simulation, existing proposals require predicting communication in the simulated system. However, this is often unpredictable. In order to deal with unpredictable systems, this paper presents a parallelization approach using asynchronous communication without modification of the SystemC simulation engine. Simulated system model is cut up and distributed across separate simulation engines, each part being evaluated in parallel of others. Functional consistency is preserved thanks to the simulated system write exclusive memory access policy while temporal consistency is guaranteed using explicit synchronization. Experimental results show up a speed-up up to 13x on 16 processors.
{"title":"A systemc TLM framework for distributed simulation of complex systems with unpredictable communication","authors":"J. Peeters, N. Ventroux, Tanguy Sassolas, L. Lacassagne","doi":"10.1109/DASIP.2011.6136847","DOIUrl":"https://doi.org/10.1109/DASIP.2011.6136847","url":null,"abstract":"Increasingly complex systems need parallelized simulation engines. In the context of SystemC simulation, existing proposals require predicting communication in the simulated system. However, this is often unpredictable. In order to deal with unpredictable systems, this paper presents a parallelization approach using asynchronous communication without modification of the SystemC simulation engine. Simulated system model is cut up and distributed across separate simulation engines, each part being evaluated in parallel of others. Functional consistency is preserved thanks to the simulated system write exclusive memory access policy while temporal consistency is guaranteed using explicit synchronization. Experimental results show up a speed-up up to 13x on 16 processors.","PeriodicalId":199500,"journal":{"name":"Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121444752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-01DOI: 10.1109/DASIP.2011.6136902
Y. Blanchard, A. Dupret, A. Peizerat
Development of smart CMOS imagers is a complex design task where the verification of an architecture composed of a matrix of pixels intermixed with analog and digital electronics is playing an important part. New generations of imager using 3D integration will allow even more processing to be done in-situ. Verification has to be done locally for the pixel and globally for the architecture. Design exploration and validation problematic has shifted from mostly the analog domain to the validation of a complex SOC with millions of parallel processors, the pixels. In this paper we present a methodology using the SystemC language for the creation of fast models for validation and a first level evaluation of performance of large CMOS imager architectures.
{"title":"Systemc modelization for fast validation of imager architectures","authors":"Y. Blanchard, A. Dupret, A. Peizerat","doi":"10.1109/DASIP.2011.6136902","DOIUrl":"https://doi.org/10.1109/DASIP.2011.6136902","url":null,"abstract":"Development of smart CMOS imagers is a complex design task where the verification of an architecture composed of a matrix of pixels intermixed with analog and digital electronics is playing an important part. New generations of imager using 3D integration will allow even more processing to be done in-situ. Verification has to be done locally for the pixel and globally for the architecture. Design exploration and validation problematic has shifted from mostly the analog domain to the validation of a complex SOC with millions of parallel processors, the pixels. In this paper we present a methodology using the SystemC language for the creation of fast models for validation and a first level evaluation of performance of large CMOS imager architectures.","PeriodicalId":199500,"journal":{"name":"Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132356107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}