A Multi-solution Study on GDPR AI-enabled Completeness Checking of DPAs

IF 3.6 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Empirical Software Engineering Pub Date : 2024-06-14 DOI:10.1007/s10664-024-10491-3

Muhammad Ilyas Azeem, Sallam Abualhaija

{"title":"A Multi-solution Study on GDPR AI-enabled Completeness Checking of DPAs","authors":"Muhammad Ilyas Azeem, Sallam Abualhaija","doi":"10.1007/s10664-024-10491-3","DOIUrl":null,"url":null,"abstract":"Specifying legal requirements for software systems to ensure their compliance with the applicable regulations is a major concern of requirements engineering. Personal data which is collected by an organization is often shared with other organizations to perform certain processing activities. In such cases, the General Data Protection Regulation (GDPR) requires issuing a data processing agreement (DPA) which regulates the processing and further ensures that personal data remains protected. Violating GDPR can lead to huge fines reaching to billions of Euros. Software systems involving personal data processing must adhere to the legal obligations stipulated both at a general level in GDPR as well as the obligations outlined in DPAs highlighting specific business. In other words, a DPA is yet another source from which requirements engineers can elicit legal requirements. However, the DPA must be complete according to GDPR to ensure that the elicited requirements cover the complete set of obligations. Therefore, checking the completeness of DPAs is a prerequisite step towards developing a compliant system. Analyzing DPAs with respect to GDPR entirely manually is time consuming and requires adequate legal expertise. In this paper, we propose an automation strategy that addresses the completeness checking of DPAs against GDPR provisions as a text classification problem. Specifically, we pursue ten alternative solutions which are enabled by different technologies, namely traditional machine learning, deep learning, language modeling, and few-shot learning. The goal of our work is to empirically examine how these different technologies fare in the legal domain. We computed F\\(_2\\) score on a set of 30 real DPAs. Our evaluation shows that best-performing solutions yield F\\(_2\\) score of 86.7% and 89.7% are based on pre-trained BERT and RoBERTa language models. Our analysis further shows that other alternative solutions based on deep learning (e.g., BiLSTM) and few-shot learning (e.g., SetFit) can achieve comparable accuracy, yet are more efficient to develop.","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"1 1","pages":""},"PeriodicalIF":3.6000,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Empirical Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10664-024-10491-3","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Specifying legal requirements for software systems to ensure their compliance with the applicable regulations is a major concern of requirements engineering. Personal data which is collected by an organization is often shared with other organizations to perform certain processing activities. In such cases, the General Data Protection Regulation (GDPR) requires issuing a data processing agreement (DPA) which regulates the processing and further ensures that personal data remains protected. Violating GDPR can lead to huge fines reaching to billions of Euros. Software systems involving personal data processing must adhere to the legal obligations stipulated both at a general level in GDPR as well as the obligations outlined in DPAs highlighting specific business. In other words, a DPA is yet another source from which requirements engineers can elicit legal requirements. However, the DPA must be complete according to GDPR to ensure that the elicited requirements cover the complete set of obligations. Therefore, checking the completeness of DPAs is a prerequisite step towards developing a compliant system. Analyzing DPAs with respect to GDPR entirely manually is time consuming and requires adequate legal expertise. In this paper, we propose an automation strategy that addresses the completeness checking of DPAs against GDPR provisions as a text classification problem. Specifically, we pursue ten alternative solutions which are enabled by different technologies, namely traditional machine learning, deep learning, language modeling, and few-shot learning. The goal of our work is to empirically examine how these different technologies fare in the legal domain. We computed F\(_2\) score on a set of 30 real DPAs. Our evaluation shows that best-performing solutions yield F\(_2\) score of 86.7% and 89.7% are based on pre-trained BERT and RoBERTa language models. Our analysis further shows that other alternative solutions based on deep learning (e.g., BiLSTM) and few-shot learning (e.g., SetFit) can achieve comparable accuracy, yet are more efficient to develop.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

关于 GDPR 人工智能支持的 DPA 完整性检查的多方案研究

明确软件系统的法律要求，确保其符合适用的法规，是需求工程的一个主要关注点。组织收集的个人数据通常会与其他组织共享，以执行某些处理活动。在这种情况下，《一般数据保护条例》（GDPR）要求签发数据处理协议（DPA），对数据处理进行规范，并进一步确保个人数据受到保护。违反 GDPR 可导致高达数十亿欧元的巨额罚款。涉及个人数据处理的软件系统必须遵守 GDPR 中规定的一般法律义务，以及 DPA 中概述的针对特定业务的义务。换句话说，DPA 是需求工程师可以从中获得法律要求的另一个来源。然而，根据 GDPR，DPA 必须是完整的，以确保所激发的需求涵盖整套义务。因此，检查 DPA 的完整性是开发合规系统的前提步骤。完全手动分析 DPA 与 GDPR 的关系非常耗时，而且需要足够的法律专业知识。在本文中，我们提出了一种自动化策略，将根据 GDPR 条款对 DPA 进行完整性检查作为一个文本分类问题来解决。具体来说，我们采用了十种不同技术的替代解决方案，即传统机器学习、深度学习、语言建模和少量学习。我们工作的目标是通过实证研究这些不同技术在法律领域的应用情况。我们在一组 30 个真实的 DPA 上计算了 F\(_2\) 分数。我们的评估显示，基于预训练的 BERT 和 RoBERTa 语言模型，表现最好的解决方案的 F\(_2\) 分数分别为 86.7% 和 89.7%。我们的分析进一步表明，其他基于深度学习（如 BiLSTM）和少量学习（如 SetFit）的替代解决方案可以达到相当的准确率，但开发效率更高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Empirical Software Engineering 工程技术-计算机：软件工程

CiteScore

8.50

自引率

12.20%

发文量

169

审稿时长

>12 weeks

期刊介绍： Empirical Software Engineering provides a forum for applied software engineering research with a strong empirical component, and a venue for publishing empirical results relevant to both researchers and practitioners. Empirical studies presented here usually involve the collection and analysis of data and experience that can be used to characterize, evaluate and reveal relationships between software development deliverables, practices, and technologies. Over time, it is expected that such empirical results will form a body of knowledge leading to widely accepted and well-formed theories. The journal also offers industrial experience reports detailing the application of software technologies - processes, methods, or tools - and their effectiveness in industrial settings. Empirical Software Engineering promotes the publication of industry-relevant research, to address the significant gap between research and practice.