{"title":"基于软件的轻量级多线程重叠普通处理器的内存访问延迟","authors":"Cihang Jiang, Youhui Zhang, Weimin Zheng","doi":"10.1109/ICPP.2015.71","DOIUrl":null,"url":null,"abstract":"Emerging services applications operate on vast datasets that are kept in DRAM to minimize latency and to improve throughput. A considerable part of them have irregular memory references and then caused the serious locality issue. This paper presents a Software-based LIght weight Multithreading framework, SLIM, to conquer this problem for commodity hardware, which still keeps the simple style of multithreading programming. The principle is fairly straight: as issuing an irregular memory reference, the current fine-granularity thread uses some primitive of asynchronous memory-accesses and then switches itself out for others' execution to overlap long memory-latencies. Meanwhile, SLIM tries to maintain most contents of thread-contexts in the on-chip cache to reduce cache-misses. Therefore, the main challenge lies in how to improve the cache behavior at the expense of more instructions involved for context-switches and smaller cache-space left for applications. Consequently, we have proposed a corresponding performance model to guide the design, which is also verified by tests. Moreover, an optimized synchronization mechanism has been designed. For some classic irregular application, excessive tests have been carried out to explore the effects on performance of system configurations, including the aggressiveness of data-pre-fetch, the distribution of tasks among cores / CPUs, etc. Results show that it can achieve higher performance than the counterpart using traditional threads, under different data scales. Even compared to some tricky codes with manual optimizations, its performance is comparable and it has still reserved the simple programing manner of high-concurrency applications.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"145 3","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Software-Based Lightweight Multithreading to Overlap Memory-Access Latencies of Commodity Processors\",\"authors\":\"Cihang Jiang, Youhui Zhang, Weimin Zheng\",\"doi\":\"10.1109/ICPP.2015.71\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emerging services applications operate on vast datasets that are kept in DRAM to minimize latency and to improve throughput. A considerable part of them have irregular memory references and then caused the serious locality issue. This paper presents a Software-based LIght weight Multithreading framework, SLIM, to conquer this problem for commodity hardware, which still keeps the simple style of multithreading programming. The principle is fairly straight: as issuing an irregular memory reference, the current fine-granularity thread uses some primitive of asynchronous memory-accesses and then switches itself out for others' execution to overlap long memory-latencies. Meanwhile, SLIM tries to maintain most contents of thread-contexts in the on-chip cache to reduce cache-misses. Therefore, the main challenge lies in how to improve the cache behavior at the expense of more instructions involved for context-switches and smaller cache-space left for applications. Consequently, we have proposed a corresponding performance model to guide the design, which is also verified by tests. Moreover, an optimized synchronization mechanism has been designed. For some classic irregular application, excessive tests have been carried out to explore the effects on performance of system configurations, including the aggressiveness of data-pre-fetch, the distribution of tasks among cores / CPUs, etc. Results show that it can achieve higher performance than the counterpart using traditional threads, under different data scales. Even compared to some tricky codes with manual optimizations, its performance is comparable and it has still reserved the simple programing manner of high-concurrency applications.\",\"PeriodicalId\":423007,\"journal\":{\"name\":\"2015 44th International Conference on Parallel Processing\",\"volume\":\"145 3\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 44th International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPP.2015.71\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 44th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2015.71","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Software-Based Lightweight Multithreading to Overlap Memory-Access Latencies of Commodity Processors
Emerging services applications operate on vast datasets that are kept in DRAM to minimize latency and to improve throughput. A considerable part of them have irregular memory references and then caused the serious locality issue. This paper presents a Software-based LIght weight Multithreading framework, SLIM, to conquer this problem for commodity hardware, which still keeps the simple style of multithreading programming. The principle is fairly straight: as issuing an irregular memory reference, the current fine-granularity thread uses some primitive of asynchronous memory-accesses and then switches itself out for others' execution to overlap long memory-latencies. Meanwhile, SLIM tries to maintain most contents of thread-contexts in the on-chip cache to reduce cache-misses. Therefore, the main challenge lies in how to improve the cache behavior at the expense of more instructions involved for context-switches and smaller cache-space left for applications. Consequently, we have proposed a corresponding performance model to guide the design, which is also verified by tests. Moreover, an optimized synchronization mechanism has been designed. For some classic irregular application, excessive tests have been carried out to explore the effects on performance of system configurations, including the aggressiveness of data-pre-fetch, the distribution of tasks among cores / CPUs, etc. Results show that it can achieve higher performance than the counterpart using traditional threads, under different data scales. Even compared to some tricky codes with manual optimizations, its performance is comparable and it has still reserved the simple programing manner of high-concurrency applications.