Pub Date : 2024-03-01DOI: 10.1007/s10515-024-00421-4
Chia-Yi Su, Collin McMillan
A code summary is a brief natural language description of source code. Summaries are usually only a single sentence long, and yet form the backbone of developer documentation. A short descriptions such as “changes all visible polygons to the color blue” can give a programmer a high-level idea of what code does without the effort of reading the code itself. Recently, products based on Large Language Models such as ChatGPT have demonstrated a strong ability to write these descriptions automatically. However, to use these tools, programmers must send their code to untrusted third parties for processing (e.g., via an API call). This loss of custody is not acceptable to many organizations. In this paper, we present an alternative: we train an open source model using sample output generated by GPT(-)3.5 in a process related to knowledge distillation. Our model is small enough (350 m parameters) to be run on a single 16gb GPU, yet we show in our evaluation that it is large enough to mimic GPT(-)3.5 on this task.
代码摘要是对源代码的简短自然语言描述。摘要通常只有一句话的长度,但却是开发人员文档的支柱。简短的描述,如 "将所有可见多边形变为蓝色",可以让程序员对代码的作用有一个高层次的概念,而无需费力阅读代码本身。最近,基于大型语言模型的产品(如 ChatGPT)已经展示了自动编写这些描述的强大能力。但是,要使用这些工具,程序员必须将他们的代码发送给不受信任的第三方进行处理(例如,通过 API 调用)。对于许多组织来说,这种监护权的丧失是不可接受的。在本文中,我们提出了一个替代方案:我们使用 GPT(-)3.5 在知识提炼相关过程中生成的样本输出来训练一个开源模型。我们的模型足够小(350 m 参数),可以在单个 16gb GPU 上运行,但我们在评估中表明,它足够大,可以在这项任务上模仿 GPT(-)3.5 。
{"title":"Distilled GPT for source code summarization","authors":"Chia-Yi Su, Collin McMillan","doi":"10.1007/s10515-024-00421-4","DOIUrl":"10.1007/s10515-024-00421-4","url":null,"abstract":"<div><p>A code summary is a brief natural language description of source code. Summaries are usually only a single sentence long, and yet form the backbone of developer documentation. A short descriptions such as “changes all visible polygons to the color blue” can give a programmer a high-level idea of what code does without the effort of reading the code itself. Recently, products based on Large Language Models such as ChatGPT have demonstrated a strong ability to write these descriptions automatically. However, to use these tools, programmers must send their code to untrusted third parties for processing (e.g., via an API call). This loss of custody is not acceptable to many organizations. In this paper, we present an alternative: we train an open source model using sample output generated by GPT<span>(-)</span>3.5 in a process related to knowledge distillation. Our model is small enough (350 m parameters) to be run on a single 16gb GPU, yet we show in our evaluation that it is large enough to mimic GPT<span>(-)</span>3.5 on this task.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"31 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140020008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-01DOI: 10.1007/s10515-024-00414-3
Erik M. Fredericks, Jared M. Moore, Abigail C. Diller
Generative art is a domain in which artistic output is created via a procedure or heuristic that may result in digital and/or physical results. A generative artist will typically act as a domain expert by specifying the algorithms that will form the basis of the piece as well as defining and refining parameters that can impact the results, however such efforts can require a significant amount of time to generate the final output. This article presents and extends GenerativeGI, an evolutionary computation-based technique for creating generative art by automatically searching through combinations of artistic techniques and their accompanying parameters to produce outputs desirable by the designer. Generative art techniques and their respective parameters are encoded within a grammar that is then the target for genetic improvement. This grammar-based approach, combined with a many-objective evolutionary algorithm, enables the designer to efficiently search through a massive number of possible outputs that reflect their aesthetic preferences. We included a total of 15 generative art techniques and performed three separate empirical evaluations, each of which targets different aesthetic preferences and varying aspects of the search heuristic. Experimental results suggest that GenerativeGI can produce outputs that are significantly more novel than those generated by random or single objective search. Furthermore, GenerativeGI produces individuals with a larger number of relevant techniques used to generate their overall composition.
{"title":"GenerativeGI: creating generative art with genetic improvement","authors":"Erik M. Fredericks, Jared M. Moore, Abigail C. Diller","doi":"10.1007/s10515-024-00414-3","DOIUrl":"10.1007/s10515-024-00414-3","url":null,"abstract":"<div><p>Generative art is a domain in which artistic output is created via a procedure or heuristic that may result in digital and/or physical results. A generative artist will typically act as a domain expert by specifying the algorithms that will form the basis of the piece as well as defining and refining parameters that can impact the results, however such efforts can require a significant amount of time to generate the final output. This article presents and extends <i>GenerativeGI</i>, an evolutionary computation-based technique for creating generative art by automatically searching through combinations of artistic techniques and their accompanying parameters to produce outputs desirable by the designer. Generative art techniques and their respective parameters are encoded within a grammar that is then the target for genetic improvement. This grammar-based approach, combined with a many-objective evolutionary algorithm, enables the designer to efficiently search through a massive number of possible outputs that reflect their aesthetic preferences. We included a total of 15 generative art techniques and performed three separate empirical evaluations, each of which targets different aesthetic preferences and varying aspects of the search heuristic. Experimental results suggest that <i>GenerativeGI</i> can produce outputs that are significantly more novel than those generated by random or single objective search. Furthermore, <i>GenerativeGI</i> produces individuals with a larger number of relevant techniques used to generate their overall composition.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"31 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140020041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-01DOI: 10.1007/s10515-024-00418-z
Yang Liu, Chao Wang, Yan Ma
Smart contract is a new paradigm for the decentralized software system, which plays an important and key role in Blockchain-based application. The vulnerabilities in smart contracts are unacceptable, and some of which have caused significant economic losses. The machine learning, especially deep learning, is a very promising and potential approach to vulnerability detecting for smart contracts. At present, deep learning-based vulnerability detection methods have low accuracy, time-consuming, and too small application range. For dealing with these, we propose a novel deep learning-based vulnerability detection framework for smart contracts at opcode level, named as DL4SC. It orthogonally combines the Transformer encoder and CNN (convolutional neural networks) to detect vulnerabilities of smart contracts for the first time, and firstly exploit SSA (sparrow search algorithm) to automatically search model hyperparameters for vulnerability detection. We implement the framework DL4SC on deep learning platform Pytorch with Python, and compare it with existing works on the three public datasets and one dataset we collect. The experiment results show that DL4SC can accurately detect vulnerabilities of smart contracts, and performs better than state-of-the-art works for detecting vulnerabilities in smart contracts. The accuracy and F1-score of DL4SC are 95.29% and 95.68%, respectively.
{"title":"DL4SC: a novel deep learning-based vulnerability detection framework for smart contracts","authors":"Yang Liu, Chao Wang, Yan Ma","doi":"10.1007/s10515-024-00418-z","DOIUrl":"10.1007/s10515-024-00418-z","url":null,"abstract":"<div><p>Smart contract is a new paradigm for the decentralized software system, which plays an important and key role in Blockchain-based application. The vulnerabilities in smart contracts are unacceptable, and some of which have caused significant economic losses. The machine learning, especially deep learning, is a very promising and potential approach to vulnerability detecting for smart contracts. At present, deep learning-based vulnerability detection methods have low accuracy, time-consuming, and too small application range. For dealing with these, we propose a novel deep learning-based vulnerability detection framework for smart contracts at opcode level, named as DL4SC. It orthogonally combines the Transformer encoder and CNN (convolutional neural networks) to detect vulnerabilities of smart contracts for the first time, and firstly exploit SSA (sparrow search algorithm) to automatically search model hyperparameters for vulnerability detection. We implement the framework DL4SC on deep learning platform Pytorch with Python, and compare it with existing works on the three public datasets and one dataset we collect. The experiment results show that DL4SC can accurately detect vulnerabilities of smart contracts, and performs better than state-of-the-art works for detecting vulnerabilities in smart contracts. The accuracy and F1-score of DL4SC are 95.29% and 95.68%, respectively.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"31 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140020010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-29DOI: 10.1007/s10515-024-00422-3
Maha Ayub, Muhammad Waiz Khan, Muhammmad Umar Janjua
With the addition of multiple blockchain platforms in the ecosystem, the Dapp owners need to migrate their smart contracts from one platform to another to remain competitive, cost-effective, and secure. A smart contract is a piece of code that contains logic and data. To migrate a smart contract, whether it’s on the same blockchain platform or a different one, we need both its source code that represents the logic and data that indicate the state of the contract. The source code can be easily set up, but to complete the migration, we have to extract the current state of the contract. In this paper, we have developed an advanced state extraction technique that uses static analysis to analyze the smart contract’s call graph and events, and extracts the entire storage state from the storage trie, along with the proper associations across function calls, enabling users to visualize, manage, and transform the state as desired for migration. The soundness of the extracted state was confirmed using the method of abstract interpretation. Further, the migration adapter is designed to transform the extracted state into slot-value pairs and migrate it to the target blockchain. Using our new approach, we were able to completely analyze 14% more smart contracts with the extraction of 53% more data by analyzing function calls and event logs from 67,993 contracts and also migrated some contracts to the multiple popular EVM-compatible blockchains.
{"title":"Sound analysis and migration of data from Ethereum smart contracts","authors":"Maha Ayub, Muhammad Waiz Khan, Muhammmad Umar Janjua","doi":"10.1007/s10515-024-00422-3","DOIUrl":"10.1007/s10515-024-00422-3","url":null,"abstract":"<div><p>With the addition of multiple blockchain platforms in the ecosystem, the Dapp owners need to migrate their smart contracts from one platform to another to remain competitive, cost-effective, and secure. A smart contract is a piece of code that contains logic and data. To migrate a smart contract, whether it’s on the same blockchain platform or a different one, we need both its source code that represents the logic and data that indicate the state of the contract. The source code can be easily set up, but to complete the migration, we have to extract the current state of the contract. In this paper, we have developed an advanced state extraction technique that uses static analysis to analyze the smart contract’s call graph and events, and extracts the entire storage state from the storage trie, along with the proper associations across function calls, enabling users to visualize, manage, and transform the state as desired for migration. The soundness of the extracted state was confirmed using the method of abstract interpretation. Further, the migration adapter is designed to transform the extracted state into slot-value pairs and migrate it to the target blockchain. Using our new approach, we were able to completely analyze 14% more smart contracts with the extraction of 53% more data by analyzing function calls and event logs from 67,993 contracts and also migrated some contracts to the multiple popular EVM-compatible blockchains.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"31 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140008624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-28DOI: 10.1007/s10515-024-00419-y
Kevin Lano, Hanan Siala
The porting or translation of software applications from one programming language to another is a common requirement of organisations that utilise software, and the increasing number and diversity of programming languages makes this capability as relevant today as in previous decades. Several approaches have been used to address this challenge, including machine learning and the manual definition of direct language-to-language translation rules, however the accuracy of these approaches remains unsatisfactory. In this paper we describe a new approach to program translation using model-driven engineering techniques: reverse-engineering source programs into specifications in the UML and OCL formalisms, and then forward-engineering the specifications to the required target language. This approach can provide assurance of semantic preservation, and additionally has the advantage of extracting precise specifications of software from code. We provide an evaluation based on a comprehensive dataset of examples, including industrial cases, and compare our results to those of other approaches and tools. Our specific contributions are: (1) Reverse-engineering source programs to detailed semantic models of software behaviour, to enable semantically-correct translations and reduce re-testing costs; (2) Program abstraction processes defined by precise and explicit rules, which can be edited and configured by users; (3) A set of reusable OCL library components appropriate for representing program semantics, and which can also be used for OCL specification of new applications; (4) A systematic procedure for building program abstractors based on language grammars and semantics.
{"title":"Using model-driven engineering to automate software language translation","authors":"Kevin Lano, Hanan Siala","doi":"10.1007/s10515-024-00419-y","DOIUrl":"10.1007/s10515-024-00419-y","url":null,"abstract":"<div><p>The porting or translation of software applications from one programming language to another is a common requirement of organisations that utilise software, and the increasing number and diversity of programming languages makes this capability as relevant today as in previous decades. Several approaches have been used to address this challenge, including machine learning and the manual definition of direct language-to-language translation rules, however the accuracy of these approaches remains unsatisfactory. In this paper we describe a new approach to program translation using model-driven engineering techniques: reverse-engineering source programs into specifications in the UML and OCL formalisms, and then forward-engineering the specifications to the required target language. This approach can provide assurance of semantic preservation, and additionally has the advantage of extracting precise specifications of software from code. We provide an evaluation based on a comprehensive dataset of examples, including industrial cases, and compare our results to those of other approaches and tools. Our specific contributions are: (1) Reverse-engineering source programs to detailed <i>semantic models</i> of software behaviour, to enable semantically-correct translations and reduce re-testing costs; (2) Program abstraction processes defined by precise and explicit rules, which can be edited and configured by users; (3) A set of reusable OCL library components appropriate for representing program semantics, and which can also be used for OCL specification of new applications; (4) A systematic procedure for building program abstractors based on language grammars and semantics.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"31 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-024-00419-y.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140008564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-27DOI: 10.1007/s10515-024-00424-1
Zhiqiang Li, Jingwen Niu, Xiao-Yuan Jing
Software defect prediction is one of the most popular research topics in software engineering. The objective of defect prediction is to identify defective instances prior to the occurrence of software defects, thus it aids in more effectively prioritizing software quality assurance efforts. In this article, we delve into various prospective research directions and potential challenges in the field of defect prediction. The aim of this article is to propose a range of defect prediction techniques and methodologies for the future. These ideas are intended to enhance the practicality, explainability, and actionability of the predictions of defect models.
{"title":"Software defect prediction: future directions and challenges","authors":"Zhiqiang Li, Jingwen Niu, Xiao-Yuan Jing","doi":"10.1007/s10515-024-00424-1","DOIUrl":"10.1007/s10515-024-00424-1","url":null,"abstract":"<div><p>Software defect prediction is one of the most popular research topics in software engineering. The objective of defect prediction is to identify defective instances prior to the occurrence of software defects, thus it aids in more effectively prioritizing software quality assurance efforts. In this article, we delve into various prospective research directions and potential challenges in the field of defect prediction. The aim of this article is to propose a range of defect prediction techniques and methodologies for the future. These ideas are intended to enhance the practicality, explainability, and actionability of the predictions of defect models.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"31 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139988033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-23DOI: 10.1007/s10515-024-00416-1
Debasish Chakroborti, Kevin A. Schneider, Chanchal K. Roy
Pull-based development is widely used in popular social coding environments like GitHub and GitLab for both internal and external contributions. When critical bug fixes or features are committed to the main branch of a project, it is often desirable to also port those changes to other stable branches. This process is referred to as backporting, and pull-requests in the process are known as backports. Backports are typically determined after extensive discussion with collaborators, and it may take many days to identify backports, which commonly results in tags and references to the original pull-requests (i.e., pull-requests for the main branch) being missed. To help software development teams better identify and manage backports, we propose ReBack (Recommending Backports), a tool based on a deep-learning model for automatically identifying backports from pull-requests and related reviews, discussions, metadata, and committed code. ReBack predicted backports with 90.98% precision and 91.81% recall from 80,000 pull-requests in 17 GitHub projects. Although the results are promising, more research is required to further support backporting, including research into automatically porting a pull-request to further reduce costs when managing software versions and branches.
{"title":"ReBack: recommending backports in social coding environments","authors":"Debasish Chakroborti, Kevin A. Schneider, Chanchal K. Roy","doi":"10.1007/s10515-024-00416-1","DOIUrl":"10.1007/s10515-024-00416-1","url":null,"abstract":"<div><p>Pull-based development is widely used in popular social coding environments like GitHub and GitLab for both internal and external contributions. When critical bug fixes or features are committed to the main branch of a project, it is often desirable to also port those changes to other stable branches. This process is referred to as backporting, and pull-requests in the process are known as backports. Backports are typically determined after extensive discussion with collaborators, and it may take many days to identify backports, which commonly results in tags and references to the original pull-requests (i.e., pull-requests for the main branch) being missed. To help software development teams better identify and manage backports, we propose <b>ReBack</b> (<b>Re</b>commending <b>Back</b>ports), a tool based on a deep-learning model for automatically identifying backports from pull-requests and related reviews, discussions, metadata, and committed code. ReBack predicted backports with 90.98% precision and 91.81% recall from 80,000 pull-requests in 17 GitHub projects. Although the results are promising, more research is required to further support backporting, including research into automatically porting a pull-request to further reduce costs when managing software versions and branches.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"31 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139953757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-21DOI: 10.1007/s10515-024-00417-0
Maryam Asgari Araghi, Vahid Rafe, Ferhat Khendek
Software testing plays a crucial role in enhancing software quality. A significant portion of the time and cost in software development is dedicated to testing. Automation, particularly in generating test cases, can greatly reduce the cost. Model-based testing aims at generating automatically test cases from models. Several model based approaches use model checking tools to automate test case generation. However, this technique faces challenges such as state space explosion and duplication of test cases. This paper introduces a novel solution based on data mining algorithms for systems specified using graph transformation systems. To overcome the aforementioned challenges, the proposed method wisely explores only a portion of the state space based on test objectives. The proposed method is implemented using the GROOVE tool set for model-checking graph transformation systems specifications. Empirical results on widely used case studies in service-oriented architecture as well as a comparison with related state-of-the-art techniques demonstrate the efficiency and superiority of the proposed approach in terms of coverage and test suite size.
{"title":"Using data mining techniques to generate test cases from graph transformation systems specifications","authors":"Maryam Asgari Araghi, Vahid Rafe, Ferhat Khendek","doi":"10.1007/s10515-024-00417-0","DOIUrl":"10.1007/s10515-024-00417-0","url":null,"abstract":"<div><p>Software testing plays a crucial role in enhancing software quality. A significant portion of the time and cost in software development is dedicated to testing. Automation, particularly in generating test cases, can greatly reduce the cost. Model-based testing aims at generating automatically test cases from models. Several model based approaches use model checking tools to automate test case generation. However, this technique faces challenges such as state space explosion and duplication of test cases. This paper introduces a novel solution based on data mining algorithms for systems specified using graph transformation systems. To overcome the aforementioned challenges, the proposed method wisely explores only a portion of the state space based on test objectives. The proposed method is implemented using the GROOVE tool set for model-checking graph transformation systems specifications. Empirical results on widely used case studies in service-oriented architecture as well as a comparison with related state-of-the-art techniques demonstrate the efficiency and superiority of the proposed approach in terms of coverage and test suite size.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"31 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139923665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-18DOI: 10.1007/s10515-024-00415-2
Davide Di Ruscio, Paola Inverardi, Patrizio Migliarini, Phuong T. Nguyen
Protecting privacy and ethics of citizens is among the core concerns raised by an increasingly digital society. Profiling users is common practice for software applications triggering the need for users, also enforced by laws, to manage privacy settings properly. Users need to properly manage these settings to protect personally identifiable information and express personal ethical preferences. This has shown to be very difficult for several concurrent reasons. However, profiling technologies can also empower users in their interaction with the digital world by reflecting personal ethical preferences and allowing for automatizing/assisting users in privacy settings. In this way, if properly reflecting users’ preferences, privacy profiling can become a key enabler for a trustworthy digital society. We focus on characterizing/collecting users’ privacy preferences and contribute a step in this direction through an empirical study on an existing dataset collected from the fitness domain. We aim to understand which set of questions is more appropriate to differentiate users according to their privacy preferences. The results reveal that a compact set of semantic-driven questions (about domain-independent privacy preferences) helps distinguish users better than a complex domain-dependent one. Based on the outcome, we implement a recommender system to provide users with suitable recommendations related to privacy choices. We then show that the proposed recommender system provides relevant settings to users, obtaining high accuracy.
{"title":"Leveraging privacy profiles to empower users in the digital society","authors":"Davide Di Ruscio, Paola Inverardi, Patrizio Migliarini, Phuong T. Nguyen","doi":"10.1007/s10515-024-00415-2","DOIUrl":"10.1007/s10515-024-00415-2","url":null,"abstract":"<div><p>Protecting privacy and ethics of citizens is among the core concerns raised by an increasingly digital society. Profiling users is common practice for software applications triggering the need for users, also enforced by laws, to manage privacy settings properly. Users need to properly manage these settings to protect personally identifiable information and express personal ethical preferences. This has shown to be very difficult for several concurrent reasons. However, profiling technologies can also empower users in their interaction with the digital world by reflecting personal ethical preferences and allowing for automatizing/assisting users in privacy settings. In this way, if properly reflecting users’ preferences, privacy profiling can become a key enabler for a trustworthy digital society. We focus on characterizing/collecting users’ privacy preferences and contribute a step in this direction through an empirical study on an existing dataset collected from the fitness domain. We aim to understand which set of questions is more appropriate to differentiate users according to their privacy preferences. The results reveal that a compact set of semantic-driven questions (about domain-independent privacy preferences) helps distinguish users better than a complex domain-dependent one. Based on the outcome, we implement a recommender system to provide users with suitable recommendations related to privacy choices. We then show that the proposed recommender system provides relevant settings to users, obtaining high accuracy.\u0000</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"31 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-024-00415-2.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139923666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-31DOI: 10.1007/s10515-024-00413-4
Rongcun Wang, Senlei Xu, Xingyu Ji, Yuan Tian, Lina Gong, Ke Wang
Deep learning has achieved great progress in automated code vulnerability detection. Several code vulnerability detection approaches based on deep learning have been proposed. However, few studies empirically studied the impacts of different deep learning models on code vulnerability detection in Python. For this reason, we strive to cover many more code representation learning models and classification models for vulnerability detection. We design and conduct an empirical study for evaluating the effects of the eighteen deep learning architectures derived from combinations of three representation learning models, i.e., Word2Vec, fastText, and CodeBERT, and six classification models, i.e., random forest, XGBoost, Multi-Layer Perception (MLP), Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Gate Recurrent Unit (GRU) on code vulnerability detection in total. Additionally, two machine learning strategies i.e., the attention and bi-directional mechanisms are also empirically compared. The statistical significance and effect size analysis between different models are also conducted. In terms of precision, recall, and F-score, Word2Vec is better than Bidirectional Encoder Representations from Transformers CodeBERT and fastText. Likewise, long short-term memory (LSTM) and gated recurrent unit (GRU) are superior to other classification models we studied. The bi-directional LSTM and GRU with attention using Word2Vec are two optimal models for solving code vulnerability detection for Python code. Moreover, they have medium or large effect sizes on LSTM and GRU using only a single mechanism. Both the representation learning models and classification models have important influences on vulnerability detection in Python code. Likewise, the bi-directional and attention mechanisms can impact the performance of code vulnerability detection.
{"title":"An extensive study of the effects of different deep learning models on code vulnerability detection in Python code","authors":"Rongcun Wang, Senlei Xu, Xingyu Ji, Yuan Tian, Lina Gong, Ke Wang","doi":"10.1007/s10515-024-00413-4","DOIUrl":"10.1007/s10515-024-00413-4","url":null,"abstract":"<div><p>Deep learning has achieved great progress in automated code vulnerability detection. Several code vulnerability detection approaches based on deep learning have been proposed. However, few studies empirically studied the impacts of different deep learning models on code vulnerability detection in Python. For this reason, we strive to cover many more code representation learning models and classification models for vulnerability detection. We design and conduct an empirical study for evaluating the effects of the eighteen deep learning architectures derived from combinations of three representation learning models, i.e., Word2Vec, fastText, and CodeBERT, and six classification models, i.e., random forest, XGBoost, Multi-Layer Perception (MLP), Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Gate Recurrent Unit (GRU) on code vulnerability detection in total. Additionally, two machine learning strategies i.e., the attention and bi-directional mechanisms are also empirically compared. The statistical significance and effect size analysis between different models are also conducted. In terms of <i>precision</i>, <i>recall</i>, and <i>F</i>-<i>score</i>, Word2Vec is better than Bidirectional Encoder Representations from Transformers CodeBERT and fastText. Likewise, long short-term memory (LSTM) and gated recurrent unit (GRU) are superior to other classification models we studied. The bi-directional LSTM and GRU with attention using Word2Vec are two optimal models for solving code vulnerability detection for Python code. Moreover, they have medium or large effect sizes on LSTM and GRU using only a single mechanism. Both the representation learning models and classification models have important influences on vulnerability detection in Python code. Likewise, the bi-directional and attention mechanisms can impact the performance of code vulnerability detection.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"31 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139657735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}