Image bitmaps, i.e., data containing pixels and visual perception, have been widely used in emerging applications for pixel operations while consuming lots of memory space and energy. Compared with legacy DRAM (dynamic random access memory), non-volatile memories (NVMs) are suitable for bitmap storage due to the salient features of high density and intrinsic durability. However, writing NVMs suffers from higher energy consumption and latency compared with read accesses. Existing precise or approximate compression schemes in NVM controllers show limited performance for bitmaps due to the irregular data patterns and variance in bitmaps. We observe the pixel-level similarity when writing bitmaps due to the analogous contents in adjacent pixels. By exploiting the pixel-level similarity, we propose SimCom, an approximate similarity-aware compression scheme in the NVM module controller, to efficiently compress data for each write access on-the-fly. The idea behind SimCom is to compress continuous similar words into the pairs of base words with runs. The storage costs for small runs are further mitigated by reusing the least significant bits of base words. SimCom adaptively selects an appropriate compression mode for various bitmap formats, thus achieving an efficient trade-off between quality and memory performance. We implement SimCom on GEM5/zsim with NVMain and evaluate the performance with real-world image/video workloads. Our results demonstrate the efficacy and efficiency of our SimCom with an efficient quality-performance trade-off.
{"title":"Approximate Similarity-Aware Compression for Non-Volatile Main Memory","authors":"Zhang-Yu Chen, Yu Hua, Peng-Fei Zuo, Yuan-Yuan Sun, Yun-Cheng Guo","doi":"10.1007/s11390-023-2565-7","DOIUrl":"https://doi.org/10.1007/s11390-023-2565-7","url":null,"abstract":"<p>Image bitmaps, i.e., data containing pixels and visual perception, have been widely used in emerging applications for pixel operations while consuming lots of memory space and energy. Compared with legacy DRAM (dynamic random access memory), non-volatile memories (NVMs) are suitable for bitmap storage due to the salient features of high density and intrinsic durability. However, writing NVMs suffers from higher energy consumption and latency compared with read accesses. Existing precise or approximate compression schemes in NVM controllers show limited performance for bitmaps due to the irregular data patterns and variance in bitmaps. We observe the pixel-level similarity when writing bitmaps due to the analogous contents in adjacent pixels. By exploiting the pixel-level similarity, we propose SimCom, an approximate similarity-aware compression scheme in the NVM module controller, to efficiently compress data for each write access on-the-fly. The idea behind SimCom is to compress continuous similar words into the pairs of base words with runs. The storage costs for small runs are further mitigated by reusing the least significant bits of base words. SimCom adaptively selects an appropriate compression mode for various bitmap formats, thus achieving an efficient trade-off between quality and memory performance. We implement SimCom on GEM5/zsim with NVMain and evaluate the performance with real-world image/video workloads. Our results demonstrate the efficacy and efficiency of our SimCom with an efficient quality-performance trade-off.</p>","PeriodicalId":50222,"journal":{"name":"Journal of Computer Science and Technology","volume":"100 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140602216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Network embedding, as an approach to learning low-dimensional representations of nodes, has been proved extremely useful in many applications, e.g., node classification and link prediction. Unfortunately, existing network embedding models are vulnerable to random or adversarial perturbations, which may degrade the performance of network embedding when being applied to downstream tasks. To achieve robust network embedding, researchers introduce adversarial training to regularize the embedding learning process by training on a mixture of adversarial examples and original examples. However, existing methods generate adversarial examples heuristically, failing to guarantee the imperceptibility of generated adversarial examples, and thus limit the power of adversarial training. In this paper, we propose a novel method Identity-Preserving Adversarial Training (IPAT) for network embedding, which generates imperceptible adversarial examples with explicit identity-preserving regularization. We formalize such identity-preserving regularization as a multi-class classification problem where each node represents a class, and we encourage each adversarial example to be discriminated as the class of its original node. Extensive experimental results on real-world datasets demonstrate that our proposed IPAT method significantly improves the robustness of network embedding models and the generalization of the learned node representations on various downstream tasks.
{"title":"Identity-Preserving Adversarial Training for Robust Network Embedding","authors":"Ke-Ting Cen, Hua-Wei Shen, Qi Cao, Bing-Bing Xu, Xue-Qi Cheng","doi":"10.1007/s11390-023-2256-4","DOIUrl":"https://doi.org/10.1007/s11390-023-2256-4","url":null,"abstract":"<p>Network embedding, as an approach to learning low-dimensional representations of nodes, has been proved extremely useful in many applications, e.g., node classification and link prediction. Unfortunately, existing network embedding models are vulnerable to random or adversarial perturbations, which may degrade the performance of network embedding when being applied to downstream tasks. To achieve robust network embedding, researchers introduce adversarial training to regularize the embedding learning process by training on a mixture of adversarial examples and original examples. However, existing methods generate adversarial examples heuristically, failing to guarantee the imperceptibility of generated adversarial examples, and thus limit the power of adversarial training. In this paper, we propose a novel method Identity-Preserving Adversarial Training (IPAT) for network embedding, which generates imperceptible adversarial examples with explicit identity-preserving regularization. We formalize such identity-preserving regularization as a multi-class classification problem where each node represents a class, and we encourage each adversarial example to be discriminated as the class of its original node. Extensive experimental results on real-world datasets demonstrate that our proposed IPAT method significantly improves the robustness of network embedding models and the generalization of the learned node representations on various downstream tasks.</p>","PeriodicalId":50222,"journal":{"name":"Journal of Computer Science and Technology","volume":"140 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140581919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-30DOI: 10.1007/s11390-023-2972-9
Ying-Chi Mao, Li-Juan Shen, Jun Wu, Ping Ping, Jie Wu
Federated learning has emerged as a distributed learning paradigm by training at each client and aggregating at a parameter server. System heterogeneity hinders stragglers from responding to the server in time with huge communication costs. Although client grouping in federated learning can solve the straggler problem, the stochastic selection strategy in client grouping neglects the impact of data distribution within each group. Besides, current client grouping approaches make clients suffer unfair participation, leading to biased performances for different clients. In order to guarantee the fairness of client participation and mitigate biased local performances, we propose a federated dynamic client selection method based on data representativity (FedSDR). FedSDR clusters clients into groups correlated with their own local computational efficiency. To estimate the significance of client datasets, we design a novel data representativity evaluation scheme based on local data distribution. Furthermore, the two most representative clients in each group are selected to optimize the global model. Finally, the DYNAMIC-SELECT algorithm updates local computational efficiency and data representativity states to regroup clients after periodic average aggregation. Evaluations on real datasets show that FedSDR improves client participation by 27.4%, 37.9%, and 23.3% compared with FedAvg, TiFL, and FedSS, respectively, taking fairness into account in federated learning. In addition, FedSDR surpasses FedAvg, FedGS, and FedMS by 21.32%, 20.4%, and 6.90%, respectively, in local test accuracy variance, balancing the performance bias of the global model across clients.
{"title":"Federated Dynamic Client Selection for Fairness Guarantee in Heterogeneous Edge Computing","authors":"Ying-Chi Mao, Li-Juan Shen, Jun Wu, Ping Ping, Jie Wu","doi":"10.1007/s11390-023-2972-9","DOIUrl":"https://doi.org/10.1007/s11390-023-2972-9","url":null,"abstract":"<p>Federated learning has emerged as a distributed learning paradigm by training at each client and aggregating at a parameter server. System heterogeneity hinders stragglers from responding to the server in time with huge communication costs. Although client grouping in federated learning can solve the straggler problem, the stochastic selection strategy in client grouping neglects the impact of data distribution within each group. Besides, current client grouping approaches make clients suffer unfair participation, leading to biased performances for different clients. In order to guarantee the fairness of client participation and mitigate biased local performances, we propose a federated dynamic client selection method based on data representativity (FedSDR). FedSDR clusters clients into groups correlated with their own local computational efficiency. To estimate the significance of client datasets, we design a novel data representativity evaluation scheme based on local data distribution. Furthermore, the two most representative clients in each group are selected to optimize the global model. Finally, the DYNAMIC-SELECT algorithm updates local computational efficiency and data representativity states to regroup clients after periodic average aggregation. Evaluations on real datasets show that FedSDR improves client participation by 27.4%, 37.9%, and 23.3% compared with FedAvg, TiFL, and FedSS, respectively, taking fairness into account in federated learning. In addition, FedSDR surpasses FedAvg, FedGS, and FedMS by 21.32%, 20.4%, and 6.90%, respectively, in local test accuracy variance, balancing the performance bias of the global model across clients.</p>","PeriodicalId":50222,"journal":{"name":"Journal of Computer Science and Technology","volume":"13 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140581938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-30DOI: 10.1007/s11390-023-1569-7
Long Zheng, Yang Li, Jie Xin, Hai-Feng Liu, Ran Zheng, Xiao-Fei Liao, Hai Jin
Data race is one of the most important concurrent anomalies in multi-threaded programs. Emerging constraint- based techniques are leveraged into race detection, which is able to find all the races that can be found by any other sound race detector. However, this constraint-based approach has serious limitations on helping programmers analyze and understand data races. First, it may report a large number of false positives due to the unrecognized dataflow propagation of the program. Second, it recommends a wide range of thread context switches to schedule the reported race (including the false one) whenever this race is exposed during the constraint-solving process. This ad hoc recommendation imposes too many context switches, which complicates the data race analysis. To address these two limitations in the state-of-the-art constraint-based race detection, this paper proposes DFTracker, an improved constraint-based race detector to recommend each data race with minimal thread context switches. Specifically, we reduce the false positives by analyzing and tracking the dataflow in the program. By this means, DFTracker thus reduces the unnecessary analysis of false race schedules. We further propose a novel algorithm to recommend an effective race schedule with minimal thread context switches for each data race. Our experimental results on the real applications demonstrate that 1) without removing any true data race, DFTracker effectively prunes false positives by 68% in comparison with the state-of-the-art constraint-based race detector; 2) DFTracker recommends as low as 2.6–8.3 (4.7 on average) thread context switches per data race in the real world, which is 81.6% fewer context switches per data race than the state-of-the-art constraint based race detector. Therefore, DFTracker can be used as an effective tool to understand the data race for programmers.
{"title":"Minimal Context-Switching Data Race Detection with Dataflow Tracking","authors":"Long Zheng, Yang Li, Jie Xin, Hai-Feng Liu, Ran Zheng, Xiao-Fei Liao, Hai Jin","doi":"10.1007/s11390-023-1569-7","DOIUrl":"https://doi.org/10.1007/s11390-023-1569-7","url":null,"abstract":"<p>Data race is one of the most important concurrent anomalies in multi-threaded programs. Emerging constraint- based techniques are leveraged into race detection, which is able to find all the races that can be found by any other sound race detector. However, this constraint-based approach has serious limitations on helping programmers analyze and understand data races. First, it may report a large number of false positives due to the unrecognized dataflow propagation of the program. Second, it recommends a wide range of thread context switches to schedule the reported race (including the false one) whenever this race is exposed during the constraint-solving process. This ad hoc recommendation imposes too many context switches, which complicates the data race analysis. To address these two limitations in the state-of-the-art constraint-based race detection, this paper proposes DFTracker, an improved constraint-based race detector to recommend each data race with minimal thread context switches. Specifically, we reduce the false positives by analyzing and tracking the dataflow in the program. By this means, DFTracker thus reduces the unnecessary analysis of false race schedules. We further propose a novel algorithm to recommend an effective race schedule with minimal thread context switches for each data race. Our experimental results on the real applications demonstrate that 1) without removing any true data race, DFTracker effectively prunes false positives by 68% in comparison with the state-of-the-art constraint-based race detector; 2) DFTracker recommends as low as 2.6–8.3 (4.7 on average) thread context switches per data race in the real world, which is 81.6% fewer context switches per data race than the state-of-the-art constraint based race detector. Therefore, DFTracker can be used as an effective tool to understand the data race for programmers.</p>","PeriodicalId":50222,"journal":{"name":"Journal of Computer Science and Technology","volume":"10 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140581825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most distributed stream processing engines (DSPEs) do not support online task management and cannot adapt to time-varying data flows. Recently, some studies have proposed online task deployment algorithms to solve this problem. However, these approaches do not guarantee the Quality of Service (QoS) when the task deployment changes at runtime, because the task migrations caused by the change of task deployments will impose an exorbitant cost. We study one of the most popular DSPEs, Apache Storm, and find out that when a task needs to be migrated, Storm has to stop the resource (implemented as a process of Worker in Storm) where the task is deployed. This will lead to the stop and restart of all tasks in the resource, resulting in the poor performance of task migrations. Aiming to solve this problem, in this paper, we propose N-Storm (Nonstop Storm), which is a task-resource decoupling DSPE. N-Storm allows tasks allocated to resources to be changed at runtime, which is implemented by a thread-level scheme for task migrations. Particularly, we add a local shared key/value store on each node to make resources aware of the changes in the allocation plan. Thus, each resource can manage its tasks at runtime. Based on N-Storm, we further propose Online Task Deployment (OTD). Differing from traditional task deployment algorithms that deploy all tasks at once without considering the cost of task migrations caused by a task re-deployment, OTD can gradually adjust the current task deployment to an optimized one based on the communication cost and the runtime states of resources. We demonstrate that OTD can adapt to different kinds of applications including computation- and communication-intensive applications. The experimental results on a real DSPE cluster show that N-Storm can avoid the system stop and save up to 87% of the performance degradation time, compared with Apache Storm and other state-of-the-art approaches. In addition, OTD can increase the average CPU usage by 51% for computation-intensive applications and reduce network communication costs by 88% for communication-intensive applications.
{"title":"Online Nonstop Task Management for Storm-Based Distributed Stream Processing Engines","authors":"Zhou Zhang, Pei-Quan Jin, Xi-Ke Xie, Xiao-Liang Wang, Rui-Cheng Liu, Shou-Hong Wan","doi":"10.1007/s11390-021-1629-9","DOIUrl":"https://doi.org/10.1007/s11390-021-1629-9","url":null,"abstract":"<p>Most distributed stream processing engines (DSPEs) do not support online task management and cannot adapt to time-varying data flows. Recently, some studies have proposed online task deployment algorithms to solve this problem. However, these approaches do not guarantee the Quality of Service (QoS) when the task deployment changes at runtime, because the task migrations caused by the change of task deployments will impose an exorbitant cost. We study one of the most popular DSPEs, Apache Storm, and find out that when a task needs to be migrated, Storm has to stop the resource (implemented as a process of Worker in Storm) where the task is deployed. This will lead to the stop and restart of all tasks in the resource, resulting in the poor performance of task migrations. Aiming to solve this problem, in this paper, we propose N-Storm (Nonstop Storm), which is a task-resource decoupling DSPE. N-Storm allows tasks allocated to resources to be changed at runtime, which is implemented by a thread-level scheme for task migrations. Particularly, we add a local shared key/value store on each node to make resources aware of the changes in the allocation plan. Thus, each resource can manage its tasks at runtime. Based on N-Storm, we further propose Online Task Deployment (OTD). Differing from traditional task deployment algorithms that deploy all tasks at once without considering the cost of task migrations caused by a task re-deployment, OTD can gradually adjust the current task deployment to an optimized one based on the communication cost and the runtime states of resources. We demonstrate that OTD can adapt to different kinds of applications including computation- and communication-intensive applications. The experimental results on a real DSPE cluster show that N-Storm can avoid the system stop and save up to 87% of the performance degradation time, compared with Apache Storm and other state-of-the-art approaches. In addition, OTD can increase the average CPU usage by 51% for computation-intensive applications and reduce network communication costs by 88% for communication-intensive applications.</p>","PeriodicalId":50222,"journal":{"name":"Journal of Computer Science and Technology","volume":"35 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140581909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The emergence of software-defined vehicles (SDVs), combined with autonomous driving technologies, has enabled a new era of vehicle computing (VC), where vehicles serve as a mobile computing platform. However, the interdisciplinary complexities of automotive systems and diverse technological requirements make developing applications for autonomous vehicles challenging. To simplify the development of applications running on SDVs, we propose a comprehensive suite of vehicle programming interfaces (VPIs). In this study, we rigorously explore the nuanced requirements for application development within the realm of VC, centering our analysis on the architectural intricacies of the Open Vehicular Data Analytics Platform (OpenVDAP). We then detail our creation of a comprehensive suite of standardized VPIs, spanning five critical categories: Hardware, Data, Computation, Service, and Management, to address these evolving programming requirements. To validate the design of VPIs, we conduct experiments using the indoor autonomous vehicle, Zebra, and develop the OpenVDAP prototype system. By comparing it with the industry-influential AUTOSAR interface, our VPIs demonstrate significant enhancements in programming efficiency, marking an important advancement in the field of SDV application development. We also show a case study and evaluate its performance. Our work highlights that VPIs significantly enhance the efficiency of developing applications on VC. They meet both current and future technological demands and propel the software-defined automotive industry toward a more interconnected and intelligent future.
{"title":"VPI: Vehicle Programming Interface for Vehicle Computing","authors":"Bao-Fu Wu, Ren Zhong, Yuxin Wang, Jian Wan, Ji-Lin Zhang, Weisong Shi","doi":"10.1007/s11390-024-4035-2","DOIUrl":"https://doi.org/10.1007/s11390-024-4035-2","url":null,"abstract":"<p>The emergence of software-defined vehicles (SDVs), combined with autonomous driving technologies, has enabled a new era of vehicle computing (VC), where vehicles serve as a mobile computing platform. However, the interdisciplinary complexities of automotive systems and diverse technological requirements make developing applications for autonomous vehicles challenging. To simplify the development of applications running on SDVs, we propose a comprehensive suite of vehicle programming interfaces (VPIs). In this study, we rigorously explore the nuanced requirements for application development within the realm of VC, centering our analysis on the architectural intricacies of the Open Vehicular Data Analytics Platform (OpenVDAP). We then detail our creation of a comprehensive suite of standardized VPIs, spanning five critical categories: Hardware, Data, Computation, Service, and Management, to address these evolving programming requirements. To validate the design of VPIs, we conduct experiments using the indoor autonomous vehicle, Zebra, and develop the OpenVDAP prototype system. By comparing it with the industry-influential AUTOSAR interface, our VPIs demonstrate significant enhancements in programming efficiency, marking an important advancement in the field of SDV application development. We also show a case study and evaluate its performance. Our work highlights that VPIs significantly enhance the efficiency of developing applications on VC. They meet both current and future technological demands and propel the software-defined automotive industry toward a more interconnected and intelligent future.</p>","PeriodicalId":50222,"journal":{"name":"Journal of Computer Science and Technology","volume":"13 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140581911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-30DOI: 10.1007/s11390-023-3011-6
Yu-Jin Yan, Hai-Bo Li, Tong Zhao, Lin-Wang Wang, Lin Shi, Tao Liu, Guang-Ming Tan, Wei-Le Jia, Ning-Hui Sun
The growing demand for semiconductor devices simulation poses a big challenge for large-scale electronic structure calculations. Among various methods, the linearly scaling three-dimensional fragment (LS3DF) method exhibits excellent scalability in large-scale simulations. Based on algorithmic and system-level optimizations, we propose a highly scalable and highly efficient implementation of LS3DF on a domestic heterogeneous supercomputer equipped with accelerators. In terms of algorithmic optimizations, the original all-band conjugate gradient algorithm is refined to achieve faster convergence, and mixed precision computing is adopted to increase overall efficiency. In terms of system-level optimizations, the original two-layer parallel structure is replaced by a coarse-grained parallel method. Optimization strategies such as multi-stream, kernel fusion, and redundant computation removal are proposed to increase further utilization of the computational power provided by the heterogeneous machines. As a result, our optimized LS3DF can scale to a 10-million silicon atoms system, attaining a peak performance of 34.8 PFLOPS (21.2% of the peak). All the improvements can be adapted to the next-generation supercomputers for larger simulations.
{"title":"10-Million Atoms Simulation of First-Principle Package LS3DF","authors":"Yu-Jin Yan, Hai-Bo Li, Tong Zhao, Lin-Wang Wang, Lin Shi, Tao Liu, Guang-Ming Tan, Wei-Le Jia, Ning-Hui Sun","doi":"10.1007/s11390-023-3011-6","DOIUrl":"https://doi.org/10.1007/s11390-023-3011-6","url":null,"abstract":"<p>The growing demand for semiconductor devices simulation poses a big challenge for large-scale electronic structure calculations. Among various methods, the linearly scaling three-dimensional fragment (LS3DF) method exhibits excellent scalability in large-scale simulations. Based on algorithmic and system-level optimizations, we propose a highly scalable and highly efficient implementation of LS3DF on a domestic heterogeneous supercomputer equipped with accelerators. In terms of algorithmic optimizations, the original all-band conjugate gradient algorithm is refined to achieve faster convergence, and mixed precision computing is adopted to increase overall efficiency. In terms of system-level optimizations, the original two-layer parallel structure is replaced by a coarse-grained parallel method. Optimization strategies such as multi-stream, kernel fusion, and redundant computation removal are proposed to increase further utilization of the computational power provided by the heterogeneous machines. As a result, our optimized LS3DF can scale to a 10-million silicon atoms system, attaining a peak performance of 34.8 PFLOPS (21.2% of the peak). All the improvements can be adapted to the next-generation supercomputers for larger simulations.</p>","PeriodicalId":50222,"journal":{"name":"Journal of Computer Science and Technology","volume":"65 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140581920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-30DOI: 10.1007/s11390-021-1277-0
Gang Wang, Xiang Li, Zi-Yi Guo, Da-Wei Yin, Shuai Ma
Scene-based recommendation has proven its usefulness in E-commerce, by recommending commodities based on a given scene. However, scenes are typically unknown in advance, which necessitates scene discovery for E-commerce. In this article, we study scene discovery for E-commerce systems. We first formalize a scene as a set of commodity categories that occur simultaneously and frequently in real-world situations, and model an E-commerce platform as a heterogeneous information network (HIN), whose nodes and links represent different types of objects and different types of relationships between objects, respectively. We then formulate the scene mining problem for E-commerce as an unsupervised learning problem that finds the overlapping clusters of commodity categories in the HIN. To solve the problem, we propose a non-negative matrix factorization based method SMEC (Scene Mining for E-Commerce), and theoretically prove its convergence. Using six real-world E-commerce datasets, we finally conduct an extensive experimental study to evaluate SMEC against 13 other methods, and show that SMEC consistently outperforms its competitors with regard to various evaluation measures.
基于场景的推荐已被证明在电子商务中非常有用,它可以根据给定的场景推荐商品。然而,场景通常是事先未知的,这就需要为电子商务发现场景。本文将研究电子商务系统的场景发现。我们首先将场景形式化为一组在现实世界中同时频繁出现的商品类别,并将电子商务平台建模为一个异构信息网络(HIN),其节点和链接分别代表不同类型的对象和对象之间不同类型的关系。然后,我们将电子商务的场景挖掘问题表述为一个无监督学习问题,即在 HIN 中找到商品类别的重叠聚类。为了解决这个问题,我们提出了一种基于非负矩阵因式分解的方法 SMEC(电子商务场景挖掘),并从理论上证明了它的收敛性。最后,我们利用六个真实世界的电子商务数据集进行了广泛的实验研究,将 SMEC 与其他 13 种方法进行了对比评估,结果表明 SMEC 在各种评估指标上始终优于其竞争对手。
{"title":"SMEC: Scene Mining for E-Commerce","authors":"Gang Wang, Xiang Li, Zi-Yi Guo, Da-Wei Yin, Shuai Ma","doi":"10.1007/s11390-021-1277-0","DOIUrl":"https://doi.org/10.1007/s11390-021-1277-0","url":null,"abstract":"<p>Scene-based recommendation has proven its usefulness in E-commerce, by recommending commodities based on a given scene. However, scenes are typically unknown in advance, which necessitates scene discovery for E-commerce. In this article, we study scene discovery for E-commerce systems. We first formalize a scene as a set of commodity categories that occur simultaneously and frequently in real-world situations, and model an E-commerce platform as a heterogeneous information network (HIN), whose nodes and links represent different types of objects and different types of relationships between objects, respectively. We then formulate the scene mining problem for E-commerce as an unsupervised learning problem that finds the overlapping clusters of commodity categories in the HIN. To solve the problem, we propose a non-negative matrix factorization based method SMEC (Scene Mining for E-Commerce), and theoretically prove its convergence. Using six real-world E-commerce datasets, we finally conduct an extensive experimental study to evaluate SMEC against 13 other methods, and show that SMEC consistently outperforms its competitors with regard to various evaluation measures.</p>","PeriodicalId":50222,"journal":{"name":"Journal of Computer Science and Technology","volume":"18 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140602367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-30DOI: 10.1007/s11390-023-1601-y
Shi-Qiang Nie, Chi Zhang, Wei-Guo Wu
Triple-level cell (TLC) NAND flash is increasingly adopted to build solid-state drives (SSDs) for modern computer systems. While TLC NAND flash effectively improves storage density, it faces severe reliability issues; in particular, the pages exhibit different raw bit error rates (RBERs). Integrating strong low-density parity-check (LDPC) code helps to improve reliability but suffers from prolonged and proportional read latency due to multiple read retries for worse pages. The straightforward idea is that dispersing page-size data across several pages in different types can achieve a lower average RBER and reduce the read latency. However, directly implementing this simple idea into flash translation layer (FTL) induces the read amplification issue as one logic page residing in more than one physical page brings several read operations. In this paper, we propose the Dynamic Request Interleaving (DIR) technology for improving the performance of TLC NAND flash-based SSDs, in particular, the aged ones with large RBERs. DIR exploits the observation that the latency of an I/O request is determined, without considering the queuing time, by the access of the slowest device page, i.e., the page that has the highest RBER. By grouping consecutive logical pages that have high locality and interleaving their encoded data in different types of device pages that have different RBERs, DIR effectively reduces the number of read retries for LDPC with limited read amplification. To meet the requirement of allocating hybrid page types for interleaved data, we also design a page-interleaving friendly page allocation scheme, which splits all the planes into multi-plane regions for storing the interleaved data and single-plane regions for storing the normal data. The pages in the multi-plane region can be read/written in parallel by the proposed multi-plane command and avoid the read amplification issue. Based on the DIR scheme and the proposed page allocation scheme, we build DIR-enable FTL, which integrates the proposed schemes into the FTL with some modifications. Our experimental results show that adopting DIR in aged SSDs exploits nearly 33% locality from I/O requests and, on average, reduces 43% read latency over conventional aged SSDs.
{"title":"DIR: Dynamic Request Interleaving for Improving the Read Performance of Aged Solid-State Drives","authors":"Shi-Qiang Nie, Chi Zhang, Wei-Guo Wu","doi":"10.1007/s11390-023-1601-y","DOIUrl":"https://doi.org/10.1007/s11390-023-1601-y","url":null,"abstract":"<p>Triple-level cell (TLC) NAND flash is increasingly adopted to build solid-state drives (SSDs) for modern computer systems. While TLC NAND flash effectively improves storage density, it faces severe reliability issues; in particular, the pages exhibit different raw bit error rates (RBERs). Integrating strong low-density parity-check (LDPC) code helps to improve reliability but suffers from prolonged and proportional read latency due to multiple read retries for worse pages. The straightforward idea is that dispersing page-size data across several pages in different types can achieve a lower average RBER and reduce the read latency. However, directly implementing this simple idea into flash translation layer (FTL) induces the read amplification issue as one logic page residing in more than one physical page brings several read operations. In this paper, we propose the Dynamic Request Interleaving (DIR) technology for improving the performance of TLC NAND flash-based SSDs, in particular, the aged ones with large RBERs. DIR exploits the observation that the latency of an I/O request is determined, without considering the queuing time, by the access of the slowest device page, i.e., the page that has the highest RBER. By grouping consecutive logical pages that have high locality and interleaving their encoded data in different types of device pages that have different RBERs, DIR effectively reduces the number of read retries for LDPC with limited read amplification. To meet the requirement of allocating hybrid page types for interleaved data, we also design a page-interleaving friendly page allocation scheme, which splits all the planes into multi-plane regions for storing the interleaved data and single-plane regions for storing the normal data. The pages in the multi-plane region can be read/written in parallel by the proposed multi-plane command and avoid the read amplification issue. Based on the DIR scheme and the proposed page allocation scheme, we build DIR-enable FTL, which integrates the proposed schemes into the FTL with some modifications. Our experimental results show that adopting DIR in aged SSDs exploits nearly 33% locality from I/O requests and, on average, reduces 43% read latency over conventional aged SSDs.</p>","PeriodicalId":50222,"journal":{"name":"Journal of Computer Science and Technology","volume":"36 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140581824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brain-inspired computing is a new technology that draws on the principles of brain science and is oriented to the efficient development of artificial general intelligence (AGI), and a brain-inspired computing system is a hierarchical system composed of neuromorphic chips, basic software and hardware, and algorithms/applications that embody this technology. While the system is developing rapidly, it faces various challenges and opportunities brought by interdisciplinary research, including the issue of software and hardware fragmentation. This paper analyzes the status quo of brain-inspired computing systems. Enlightened by some design principle and methodology of general-purpose computers, it is proposed to construct “general-purpose” brain-inspired computing systems. A general-purpose brain-inspired computing system refers to a brain-inspired computing hierarchy constructed based on the design philosophy of decoupling software and hardware, which can flexibly support various brain-inspired computing applications and neuromorphic chips with different architectures. Further, this paper introduces our recent work in these aspects, including the ANN (artificial neural network)/SNN (spiking neural network) development tools, the hardware agnostic compilation infrastructure, and the chip micro-architecture with high flexibility of programming and high performance; these studies show that the “general-purpose” system can remarkably improve the efficiency of application development and enhance the productivity of basic software, thereby being conductive to accelerating the advancement of various brain-inspired algorithms and applications. We believe that this is the key to the collaborative research and development, and the evolution of applications, basic software and chips in this field, and conducive to building a favorable software/hardware ecosystem of brain-inspired computing.
{"title":"Research on General-Purpose Brain-Inspired Computing Systems","authors":"Peng Qu, Xing-Long Ji, Jia-Jie Chen, Meng Pang, Yu-Chen Li, Xiao-Yi Liu, You-Hui Zhang","doi":"10.1007/s11390-023-4002-3","DOIUrl":"https://doi.org/10.1007/s11390-023-4002-3","url":null,"abstract":"<p>Brain-inspired computing is a new technology that draws on the principles of brain science and is oriented to the efficient development of artificial general intelligence (AGI), and a brain-inspired computing system is a hierarchical system composed of neuromorphic chips, basic software and hardware, and algorithms/applications that embody this technology. While the system is developing rapidly, it faces various challenges and opportunities brought by interdisciplinary research, including the issue of software and hardware fragmentation. This paper analyzes the status quo of brain-inspired computing systems. Enlightened by some design principle and methodology of general-purpose computers, it is proposed to construct “general-purpose” brain-inspired computing systems. A general-purpose brain-inspired computing system refers to a brain-inspired computing hierarchy constructed based on the design philosophy of decoupling software and hardware, which can flexibly support various brain-inspired computing applications and neuromorphic chips with different architectures. Further, this paper introduces our recent work in these aspects, including the ANN (artificial neural network)/SNN (spiking neural network) development tools, the hardware agnostic compilation infrastructure, and the chip micro-architecture with high flexibility of programming and high performance; these studies show that the “general-purpose” system can remarkably improve the efficiency of application development and enhance the productivity of basic software, thereby being conductive to accelerating the advancement of various brain-inspired algorithms and applications. We believe that this is the key to the collaborative research and development, and the evolution of applications, basic software and chips in this field, and conducive to building a favorable software/hardware ecosystem of brain-inspired computing.</p>","PeriodicalId":50222,"journal":{"name":"Journal of Computer Science and Technology","volume":"42 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140582163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}