首页 > 最新文献

Journal of Data and Information Quality (JDIQ)最新文献

英文 中文
Editorial: Special Issue on Data Transparency—Data Quality, Annotation, and Provenance 社论:关于数据透明度的特刊——数据质量、注释和来源
Pub Date : 2022-02-03 DOI: 10.1145/3494454
M. Barhamgi, E. Bertino
Advances in Artificial Intelligence (AI) and mobile and Internet technologies have been progressively reshaping our lives over the past few years. The applications of the Internet of Things and cyber-physical systems today touch almost all aspects of our daily lives, including healthcare (e.g., remote patient monitoring environments), leisure (e.g., smart entertainment spaces), and work (e.g., smart manufacturing, asset management). For many of us, social media have become the rule rather than the exception as the way to interact, socialize, and exchange information. AI-powered systems have become a reality and started to affect our lives in important ways. These systems and services collect huge amounts of data about us and exploit it for various purposes that could affect positively or negatively our lives. Even though most of these systems claim to abide by data protection regulations and ethics, data misuse incidents keep making the headlines. In this new digital world, data transparency for end users is becoming a fundamental aspect to consider when designing, implementing, and deploying a system, service, or software [1, 3, 4]. Transparency allows users to track down and follow how their data are collected, transmitted, stored, processed, exploited, and serviced. It also allows them to verify how fairly they are treated by algorithms, software, and systems that affect their lives. Data transparency is a complex concept that is interpreted and approached in different ways by different research communities and bodies. A comprehensive definition of data transparency is proposed by Bertino et al. as “the ability of subjects to effectively gain access to all information related to data used in processes and decisions that affect the subjects” [2].
过去几年,人工智能(AI)、移动和互联网技术的进步正在逐步重塑我们的生活。如今,物联网和网络物理系统的应用几乎触及我们日常生活的方方面面,包括医疗保健(例如,远程患者监护环境)、休闲(例如,智能娱乐空间)和工作(例如,智能制造、资产管理)。对于我们中的许多人来说,作为一种互动、社交和交换信息的方式,社交媒体已经成为一种规则,而不是例外。人工智能系统已经成为现实,并开始以重要的方式影响我们的生活。这些系统和服务收集了大量关于我们的数据,并将其用于各种可能对我们的生活产生积极或消极影响的目的。尽管大多数这些系统声称遵守数据保护法规和道德规范,但数据滥用事件不断成为头条新闻。在这个新的数字世界中,最终用户的数据透明度正在成为设计、实现和部署系统、服务或软件时要考虑的一个基本方面[1,3,4]。透明度允许用户跟踪和跟踪他们的数据是如何被收集、传输、存储、处理、利用和服务的。它还允许他们验证影响他们生活的算法、软件和系统对他们的公平程度。数据透明度是一个复杂的概念,不同的研究团体和机构以不同的方式解释和处理。Bertino等人对数据透明度提出了一个全面的定义,即“主体有效地获得与影响主体的过程和决策中使用的数据相关的所有信息的能力”[2]。
{"title":"Editorial: Special Issue on Data Transparency—Data Quality, Annotation, and Provenance","authors":"M. Barhamgi, E. Bertino","doi":"10.1145/3494454","DOIUrl":"https://doi.org/10.1145/3494454","url":null,"abstract":"Advances in Artificial Intelligence (AI) and mobile and Internet technologies have been progressively reshaping our lives over the past few years. The applications of the Internet of Things and cyber-physical systems today touch almost all aspects of our daily lives, including healthcare (e.g., remote patient monitoring environments), leisure (e.g., smart entertainment spaces), and work (e.g., smart manufacturing, asset management). For many of us, social media have become the rule rather than the exception as the way to interact, socialize, and exchange information. AI-powered systems have become a reality and started to affect our lives in important ways. These systems and services collect huge amounts of data about us and exploit it for various purposes that could affect positively or negatively our lives. Even though most of these systems claim to abide by data protection regulations and ethics, data misuse incidents keep making the headlines. In this new digital world, data transparency for end users is becoming a fundamental aspect to consider when designing, implementing, and deploying a system, service, or software [1, 3, 4]. Transparency allows users to track down and follow how their data are collected, transmitted, stored, processed, exploited, and serviced. It also allows them to verify how fairly they are treated by algorithms, software, and systems that affect their lives. Data transparency is a complex concept that is interpreted and approached in different ways by different research communities and bodies. A comprehensive definition of data transparency is proposed by Bertino et al. as “the ability of subjects to effectively gain access to all information related to data used in processes and decisions that affect the subjects” [2].","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"14 1","pages":"1 - 3"},"PeriodicalIF":0.0,"publicationDate":"2022-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85165385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Challenge Paper: The Vision for Time Profiled Temporal Association Mining 挑战论文:时序时序关联挖掘的愿景
Pub Date : 2021-05-13 DOI: 10.1145/3404198
V. Radhakrishna, G. Reddy, Puligadda Veereswara Kumar, V. Janaki
Ecommerce has been the market disruptor in the modern world. Organizations have been focusing on mining enormous amounts of data to identify trends and extract crucial information from the voluminous data. Data is collected in databases and they are transactional in nature. Millions of transactions are collected in a temporal context. Also, organizations are transitioning towards NOSQL databases. The transactions are distributed into timeslots. Such a transaction data is called as timestamped temporal data. Along with the focus on mining the temporal data, extracting patterns, trends, and information, the implicit focus is also on building an efficient algorithm that is accurate with a reduction in the time taken, memory consumed, and computational efforts to scan the database. Temporal associations discovered from timestamped temporal datasets [1, 2] are known as time profiled temporal patterns. From application perspective, time profiled temporal
电子商务一直是现代世界的市场破坏者。组织一直专注于挖掘大量数据,以确定趋势并从大量数据中提取关键信息。数据收集在数据库中,它们本质上是事务性的。在一个临时上下文中收集了数百万个事务。此外,组织正在向NOSQL数据库过渡。事务被分配到时间段中。这样的事务数据称为带有时间戳的时态数据。除了关注挖掘时态数据、提取模式、趋势和信息外,隐含的重点还在于构建一种高效的算法,该算法可以准确地减少扫描数据库所花费的时间、内存消耗和计算工作量。从时间戳时间数据集[1,2]中发现的时间关联被称为时间剖面时间模式。从应用程序的角度来看,时间轮廓时态
{"title":"Challenge Paper: The Vision for Time Profiled Temporal Association Mining","authors":"V. Radhakrishna, G. Reddy, Puligadda Veereswara Kumar, V. Janaki","doi":"10.1145/3404198","DOIUrl":"https://doi.org/10.1145/3404198","url":null,"abstract":"Ecommerce has been the market disruptor in the modern world. Organizations have been focusing on mining enormous amounts of data to identify trends and extract crucial information from the voluminous data. Data is collected in databases and they are transactional in nature. Millions of transactions are collected in a temporal context. Also, organizations are transitioning towards NOSQL databases. The transactions are distributed into timeslots. Such a transaction data is called as timestamped temporal data. Along with the focus on mining the temporal data, extracting patterns, trends, and information, the implicit focus is also on building an efficient algorithm that is accurate with a reduction in the time taken, memory consumed, and computational efforts to scan the database. Temporal associations discovered from timestamped temporal datasets [1, 2] are known as time profiled temporal patterns. From application perspective, time profiled temporal","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"28 1","pages":"1 - 8"},"PeriodicalIF":0.0,"publicationDate":"2021-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78174680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Editorial: Special Issue on Quality Assessment and Management in Big Data—Part I 社论:大数据环境下的质量评估与管理专题——第一部分
Pub Date : 2021-05-06 DOI: 10.1145/3449052
Shadi A. Aljawarneh, J. Lara
It is a pleasure for us to introduce this Special Issue on Quality Assessment and Management in Big Data, Part I—Journal of Data and Information Quality, ACM. We have received 27 original submissions from which 11 final papers have been selected for publication (after a rigorous peer review process) in this issue divided into two parts. This editorial corresponds to Part I, in which we included papers related to machine learning and quality management in big data scenarios. In the era of big data [1], organizations are dealing with tremendous amounts of data, which are fast-moving and can originate from various sources, such as social networks [2], unstructured data from various websites [3], or raw feeds from sensors [4]. Big data solutions are used to optimize business processes and reduce decision-making times, so as to improve operational effectiveness. Big data practitioners are experiencing a huge number of data quality problems [5]. These can be time-consuming to solve or even lead to incorrect data analytics. How to manage quality in big data has become challenging, and thus far research has only addressed limited aspects. Given the complex nature of big data, traditional data quality management approaches cannot simply be applied to big data quality management.
很高兴向大家介绍《大数据中的质量评估与管理》特刊(第一期),美国计算机学会数据与信息质量学报。我们收到了27篇原创论文,其中11篇最终论文被选中发表(经过严格的同行评审程序),本期分为两部分。这篇社论对应于第一部分,在第一部分中,我们包含了与大数据场景中的机器学习和质量管理相关的论文。在大数据时代[1],组织正在处理大量快速移动的数据,这些数据可以来自各种来源,例如社交网络[2],来自各种网站的非结构化数据[3],或者来自传感器的原始feed[4]。利用大数据解决方案优化业务流程,减少决策次数,提高运营效率。大数据从业者正在经历大量的数据质量问题[5]。解决这些问题可能很耗时,甚至会导致不正确的数据分析。如何在大数据中管理质量是一个挑战,到目前为止,研究只涉及有限的方面。考虑到大数据的复杂性,传统的数据质量管理方法不能简单地应用于大数据质量管理。
{"title":"Editorial: Special Issue on Quality Assessment and Management in Big Data—Part I","authors":"Shadi A. Aljawarneh, J. Lara","doi":"10.1145/3449052","DOIUrl":"https://doi.org/10.1145/3449052","url":null,"abstract":"It is a pleasure for us to introduce this Special Issue on Quality Assessment and Management in Big Data, Part I—Journal of Data and Information Quality, ACM. We have received 27 original submissions from which 11 final papers have been selected for publication (after a rigorous peer review process) in this issue divided into two parts. This editorial corresponds to Part I, in which we included papers related to machine learning and quality management in big data scenarios. In the era of big data [1], organizations are dealing with tremendous amounts of data, which are fast-moving and can originate from various sources, such as social networks [2], unstructured data from various websites [3], or raw feeds from sensors [4]. Big data solutions are used to optimize business processes and reduce decision-making times, so as to improve operational effectiveness. Big data practitioners are experiencing a huge number of data quality problems [5]. These can be time-consuming to solve or even lead to incorrect data analytics. How to manage quality in big data has become challenging, and thus far research has only addressed limited aspects. Given the complex nature of big data, traditional data quality management approaches cannot simply be applied to big data quality management.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"48 1","pages":"1 - 3"},"PeriodicalIF":0.0,"publicationDate":"2021-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74263919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Developing a Global Data Breach Database and the Challenges Encountered 开发一个全球数据泄露数据库和遇到的挑战
Pub Date : 2021-01-15 DOI: 10.1145/3439873
Nelson Novaes Neto, S. Madnick, A. Paula, Natasha Malara Borges
If the mantra “data is the new oil” of our digital economy is correct, then data leak incidents are the critical disasters in the online society. The initial goal of our research was to present a comprehensive database of data breaches of personal information that took place in 2018 and 2019. This information was to be drawn from press reports, industry studies, and reports from regulatory agencies across the world. This article identified the top 430 largest data breach incidents among more than 10,000 data breach incidents. In the process, we encountered many complications, especially regarding the lack of standardization of reporting. This article should be especially interesting to the readers of JDIQ because it describes both the range of data quality and consistency issues found as well as what was learned from the database created. The database that was created, available at https://www.databreachdb.com, shows that the number of data records breached in those top 430 incidents increased from around 4B in 2018 to more than 22B in 2019. This increase occurred despite the strong efforts from regulatory agencies across the world to enforce strict rules on data protection and privacy, such as the General Data Protection Regulation (GDPR) that went into effect in Europe in May 2018. Such regulatory effort could explain the reason why there is such a large number of data breach cases reported in the European Union when compared to the U.S. (more than 10,000 data breaches publicly reported in the U.S. since 2018, while the EU reported more than 160,0001 data breaches since May 2018). However, we still face the problem of an excessive number of breach incidents around the world. This research helps to understand the challenges of proper visibility of such incidents on a global scale. The results of this research can help government entities, regulatory bodies, security and data quality researchers, companies, and managers to improve the data quality of data breach reporting and increase the visibility of the data breach landscape around the world in the future.
如果“数据是数字经济的新石油”这句话是正确的,那么数据泄露事件就是网络社会的重大灾难。我们研究的最初目标是提供一个2018年和2019年发生的个人信息数据泄露的综合数据库。这些信息将从新闻报道、行业研究和世界各地监管机构的报告中提取。本文在10,000多起数据泄露事件中确定了430起最大的数据泄露事件。在这个过程中,我们遇到了很多困难,特别是在报告缺乏标准化方面。对于JDIQ的读者来说,这篇文章应该特别有趣,因为它描述了所发现的数据质量和一致性问题的范围,以及从创建的数据库中学到的东西。在https://www.databreachdb.com上创建的数据库显示,在这430起事件中,泄露的数据记录数量从2018年的约40万条增加到2019年的逾220万条。尽管世界各地的监管机构都在努力执行严格的数据保护和隐私规则,例如2018年5月在欧洲生效的《通用数据保护条例》(GDPR),但仍出现了这一增长。这种监管努力可以解释为什么与美国相比,欧盟报告的数据泄露案件数量如此之多(自2018年以来,美国公开报告的数据泄露事件超过1万起,而欧盟自2018年5月以来报告的数据泄露事件超过160,000起)。然而,我们仍然面临着全球范围内数据泄露事件过多的问题。这项研究有助于理解在全球范围内对此类事件进行适当的可见性所面临的挑战。本研究的结果可以帮助政府机构、监管机构、安全和数据质量研究人员、公司和管理人员提高数据泄露报告的数据质量,并提高未来全球数据泄露形势的可见性。
{"title":"Developing a Global Data Breach Database and the Challenges Encountered","authors":"Nelson Novaes Neto, S. Madnick, A. Paula, Natasha Malara Borges","doi":"10.1145/3439873","DOIUrl":"https://doi.org/10.1145/3439873","url":null,"abstract":"If the mantra “data is the new oil” of our digital economy is correct, then data leak incidents are the critical disasters in the online society. The initial goal of our research was to present a comprehensive database of data breaches of personal information that took place in 2018 and 2019. This information was to be drawn from press reports, industry studies, and reports from regulatory agencies across the world. This article identified the top 430 largest data breach incidents among more than 10,000 data breach incidents. In the process, we encountered many complications, especially regarding the lack of standardization of reporting. This article should be especially interesting to the readers of JDIQ because it describes both the range of data quality and consistency issues found as well as what was learned from the database created. The database that was created, available at https://www.databreachdb.com, shows that the number of data records breached in those top 430 incidents increased from around 4B in 2018 to more than 22B in 2019. This increase occurred despite the strong efforts from regulatory agencies across the world to enforce strict rules on data protection and privacy, such as the General Data Protection Regulation (GDPR) that went into effect in Europe in May 2018. Such regulatory effort could explain the reason why there is such a large number of data breach cases reported in the European Union when compared to the U.S. (more than 10,000 data breaches publicly reported in the U.S. since 2018, while the EU reported more than 160,0001 data breaches since May 2018). However, we still face the problem of an excessive number of breach incidents around the world. This research helps to understand the challenges of proper visibility of such incidents on a global scale. The results of this research can help government entities, regulatory bodies, security and data quality researchers, companies, and managers to improve the data quality of data breach reporting and increase the visibility of the data breach landscape around the world in the future.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"64 1","pages":"1 - 33"},"PeriodicalIF":0.0,"publicationDate":"2021-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80141319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Knowledge Transfer for Entity Resolution with Siamese Neural Networks 基于Siamese神经网络的实体解析知识转移
Pub Date : 2021-01-13 DOI: 10.1145/3410157
M. Loster, Ioannis K. Koumarelas, Felix Naumann
The integration of multiple data sources is a common problem in a large variety of applications. Traditionally, handcrafted similarity measures are used to discover, merge, and integrate multiple representations of the same entity—duplicates—into a large homogeneous collection of data. Often, these similarity measures do not cope well with the heterogeneity of the underlying dataset. In addition, domain experts are needed to manually design and configure such measures, which is both time-consuming and requires extensive domain expertise. We propose a deep Siamese neural network, capable of learning a similarity measure that is tailored to the characteristics of a particular dataset. With the properties of deep learning methods, we are able to eliminate the manual feature engineering process and thus considerably reduce the effort required for model construction. In addition, we show that it is possible to transfer knowledge acquired during the deduplication of one dataset to another, and thus significantly reduce the amount of data required to train a similarity measure. We evaluated our method on multiple datasets and compare our approach to state-of-the-art deduplication methods. Our approach outperforms competitors by up to +26 percent F-measure, depending on task and dataset. In addition, we show that knowledge transfer is not only feasible, but in our experiments led to an improvement in F-measure of up to +4.7 percent.
在各种各样的应用程序中,多个数据源的集成是一个常见问题。传统上,手工制作的相似性度量用于发现、合并和集成相同实体的多个表示(副本)到大型同构数据集合中。通常,这些相似性度量不能很好地处理底层数据集的异质性。此外,需要领域专家手动设计和配置这些度量,这既耗时又需要广泛的领域专业知识。我们提出了一个深度连体神经网络,能够学习针对特定数据集特征定制的相似性度量。利用深度学习方法的特性,我们能够消除人工特征工程过程,从而大大减少模型构建所需的工作量。此外,我们还表明,可以将在一个数据集的重复数据删除过程中获得的知识转移到另一个数据集,从而显著减少训练相似性度量所需的数据量。我们在多个数据集上评估了我们的方法,并将我们的方法与最先进的重复数据删除方法进行了比较。根据任务和数据集的不同,我们的方法比竞争对手高出高达26%的F-measure。此外,我们表明,知识转移不仅是可行的,而且在我们的实验中导致F-measure的提高高达4.7%。
{"title":"Knowledge Transfer for Entity Resolution with Siamese Neural Networks","authors":"M. Loster, Ioannis K. Koumarelas, Felix Naumann","doi":"10.1145/3410157","DOIUrl":"https://doi.org/10.1145/3410157","url":null,"abstract":"The integration of multiple data sources is a common problem in a large variety of applications. Traditionally, handcrafted similarity measures are used to discover, merge, and integrate multiple representations of the same entity—duplicates—into a large homogeneous collection of data. Often, these similarity measures do not cope well with the heterogeneity of the underlying dataset. In addition, domain experts are needed to manually design and configure such measures, which is both time-consuming and requires extensive domain expertise. We propose a deep Siamese neural network, capable of learning a similarity measure that is tailored to the characteristics of a particular dataset. With the properties of deep learning methods, we are able to eliminate the manual feature engineering process and thus considerably reduce the effort required for model construction. In addition, we show that it is possible to transfer knowledge acquired during the deduplication of one dataset to another, and thus significantly reduce the amount of data required to train a similarity measure. We evaluated our method on multiple datasets and compare our approach to state-of-the-art deduplication methods. Our approach outperforms competitors by up to +26 percent F-measure, depending on task and dataset. In addition, we show that knowledge transfer is not only feasible, but in our experiments led to an improvement in F-measure of up to +4.7 percent.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"75 1","pages":"1 - 25"},"PeriodicalIF":0.0,"publicationDate":"2021-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72682060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Experience 经验
Pub Date : 2021-01-13 DOI: 10.1145/3439307
Michela Fazzolari, F. Buccafurri, G. Lax, M. Petrocchi
Over the past few years, online reviews have become very important, since they can influence the purchase decision of consumers and the reputation of businesses. Therefore, the practice of writing fake reviews can have severe consequences on customers and service providers. Various approaches have been proposed for detecting opinion spam in online reviews, especially based on supervised classifiers. In this contribution, we start from a set of effective features used for classifying opinion spam and we re-engineered them by considering the Cumulative Relative Frequency Distribution of each feature. By an experimental evaluation carried out on real data from Yelp.com, we show that the use of the distributional features is able to improve the performances of classifiers.
在过去的几年里,在线评论变得非常重要,因为它们可以影响消费者的购买决定和企业的声誉。因此,撰写虚假评论的做法可能会对客户和服务提供商造成严重后果。人们提出了各种方法来检测在线评论中的意见垃圾,特别是基于监督分类器的方法。在这篇文章中,我们从一组用于分类意见垃圾邮件的有效特征开始,并通过考虑每个特征的累积相对频率分布来重新设计它们。通过对Yelp.com的真实数据进行实验评估,我们表明使用分布特征可以提高分类器的性能。
{"title":"Experience","authors":"Michela Fazzolari, F. Buccafurri, G. Lax, M. Petrocchi","doi":"10.1145/3439307","DOIUrl":"https://doi.org/10.1145/3439307","url":null,"abstract":"Over the past few years, online reviews have become very important, since they can influence the purchase decision of consumers and the reputation of businesses. Therefore, the practice of writing fake reviews can have severe consequences on customers and service providers. Various approaches have been proposed for detecting opinion spam in online reviews, especially based on supervised classifiers. In this contribution, we start from a set of effective features used for classifying opinion spam and we re-engineered them by considering the Cumulative Relative Frequency Distribution of each feature. By an experimental evaluation carried out on real data from Yelp.com, we show that the use of the distributional features is able to improve the performances of classifiers.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"38 1","pages":"1 - 16"},"PeriodicalIF":0.0,"publicationDate":"2021-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80055136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Deep Entity Matching 深度实体匹配
Pub Date : 2021-01-06 DOI: 10.1145/3431816
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, Jin Wang, Wataru Hirota, W. Tan
Entity matching refers to the task of determining whether two different representations refer to the same real-world entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to be identified and managed. The term “entity matching” also loosely refers to the broader problem of determining whether two heterogeneous representations of different entities should be associated together. This problem has an even wider scope of applications, from determining the subsidiaries of companies to matching jobs to job seekers, which has impactful consequences. In this article, we first report our recent system DITTO, which is an example of a modern entity matching system based on pretrained language models. Then we summarize recent solutions in applying deep learning and pre-trained language models for solving the entity matching task. Finally, we discuss research directions beyond entity matching, including the promise of synergistically integrating blocking and entity matching steps together, the need to examine methods to alleviate steep training data requirements that are typical of deep learning or pre-trained language models, and the importance of generalizing entity matching solutions to handle the broader entity matching problem, which leads to an even more pressing need to explain matching outcomes.
实体匹配指的是确定两个不同的表示是否引用同一个现实世界实体的任务。对于许多数据驻留在不同来源的组织来说,这仍然是一个普遍存在的问题,并且需要重复识别和管理。术语“实体匹配”还泛指确定不同实体的两个异质表示是否应该关联在一起的更广泛的问题。这个问题的应用范围甚至更广,从确定公司的子公司到为求职者匹配工作,都会产生影响深远的后果。在本文中,我们首先报告我们最近的系统DITTO,它是基于预训练语言模型的现代实体匹配系统的一个示例。然后总结了最近应用深度学习和预训练语言模型来解决实体匹配任务的解决方案。最后,我们讨论了实体匹配之外的研究方向,包括协同整合块和实体匹配步骤的承诺,研究方法以减轻深度学习或预训练语言模型典型的急剧训练数据需求的必要性,以及推广实体匹配解决方案以处理更广泛的实体匹配问题的重要性,这导致更迫切需要解释匹配结果。
{"title":"Deep Entity Matching","authors":"Yuliang Li, Jinfeng Li, Yoshihiko Suhara, Jin Wang, Wataru Hirota, W. Tan","doi":"10.1145/3431816","DOIUrl":"https://doi.org/10.1145/3431816","url":null,"abstract":"Entity matching refers to the task of determining whether two different representations refer to the same real-world entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to be identified and managed. The term “entity matching” also loosely refers to the broader problem of determining whether two heterogeneous representations of different entities should be associated together. This problem has an even wider scope of applications, from determining the subsidiaries of companies to matching jobs to job seekers, which has impactful consequences. In this article, we first report our recent system DITTO, which is an example of a modern entity matching system based on pretrained language models. Then we summarize recent solutions in applying deep learning and pre-trained language models for solving the entity matching task. Finally, we discuss research directions beyond entity matching, including the promise of synergistically integrating blocking and entity matching steps together, the need to examine methods to alleviate steep training data requirements that are typical of deep learning or pre-trained language models, and the importance of generalizing entity matching solutions to handle the broader entity matching problem, which leads to an even more pressing need to explain matching outcomes.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"3 1","pages":"1 - 17"},"PeriodicalIF":0.0,"publicationDate":"2021-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89557564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
BLAST2 BLAST2
Pub Date : 2020-11-10 DOI: 10.1145/3394957
D. Beneventano, S. Bergamaschi, Luca Gagliardelli, Giovanni Simonini
We present BLAST2, a novel technique to efficiently extract loose schema information, i.e., metadata that can serve as a surrogate of the schema alignment task within the Entity Resolution (ER) process, to identify records that refer to the same real-world entity when integrating multiple, heterogeneous, and voluminous data sources. The loose schema information is exploited for reducing the overall complexity of ER, whose naïve solution would imply O(n2) comparisons, where n is the number of entity representations involved in the process and can be extracted by both structured and unstructured data sources. BLAST2 is completely unsupervised yet able to achieve almost the same precision and recall of supervised state-of-the-art schema alignment techniques when employed for Entity Resolution tasks, as shown in our experimental evaluation performed on two real-world datasets (composed of 7 and 10 data sources, respectively).
我们提出了BLAST2,这是一种有效提取松散模式信息的新技术,即可以作为实体解析(ER)过程中模式校准任务的代理的元数据,以便在集成多个异构和大量数据源时识别引用相同现实世界实体的记录。松散的模式信息被用于降低ER的整体复杂性,其naïve解决方案意味着O(n2)个比较,其中n是过程中涉及的实体表示的数量,可以由结构化和非结构化数据源提取。BLAST2是完全无监督的,但当用于实体解决任务时,能够达到与有监督的最先进模式对齐技术几乎相同的精度和召回率,正如我们在两个真实数据集(分别由7个和10个数据源组成)上进行的实验评估所示。
{"title":"BLAST2","authors":"D. Beneventano, S. Bergamaschi, Luca Gagliardelli, Giovanni Simonini","doi":"10.1145/3394957","DOIUrl":"https://doi.org/10.1145/3394957","url":null,"abstract":"We present BLAST2, a novel technique to efficiently extract loose schema information, i.e., metadata that can serve as a surrogate of the schema alignment task within the Entity Resolution (ER) process, to identify records that refer to the same real-world entity when integrating multiple, heterogeneous, and voluminous data sources. The loose schema information is exploited for reducing the overall complexity of ER, whose naïve solution would imply O(n2) comparisons, where n is the number of entity representations involved in the process and can be extracted by both structured and unstructured data sources. BLAST2 is completely unsupervised yet able to achieve almost the same precision and recall of supervised state-of-the-art schema alignment techniques when employed for Entity Resolution tasks, as shown in our experimental evaluation performed on two real-world datasets (composed of 7 and 10 data sources, respectively).","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"5 3 1","pages":"1 - 22"},"PeriodicalIF":0.0,"publicationDate":"2020-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83475980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
RuleHub
Pub Date : 2020-10-15 DOI: 10.1145/3409384
N. Ahmadi, Thi-Thuy-Duyen Truong, Le-Hong-Mai Dao, Stefano Ortona, Paolo Papotti
Entity-centric knowledge graphs (KGs) are now popular to collect facts about entities. KGs have rich schemas with a large number of different types and predicates to describe the entities and their relationships. On these rich schemas, logical rules are used to represent dependencies between the data elements. While rules are useful in query answering, data curation, and other tasks, they usually do not come with the KGs. Such rules have to be manually defined or discovered with the help of rule mining methods. We believe this rule-collection task should be done collectively to better capitalize our understanding of the data and to avoid redundant work conducted on the same KGs. For this reason, we introduce RuleHub, our extensible corpus of rules for public KGs. RuleHub provides functionalities for the archival and the retrieval of rules to all users, with an extensible architecture that does not constrain the KG or the type of rules supported. We are populating the corpus with thousands of rules from the most popular KGs and report on our experiments on automatically characterizing the quality of a rule with statistical measures.
以实体为中心的知识图(KGs)现在很流行用于收集关于实体的事实。kg具有丰富的模式,其中包含大量不同的类型和谓词,用于描述实体及其关系。在这些富模式上,逻辑规则用于表示数据元素之间的依赖关系。虽然规则在查询回答、数据管理和其他任务中很有用,但它们通常不与kg一起提供,这些规则必须通过规则挖掘方法手动定义或发现。我们认为这个规则收集任务应该集体完成,以更好地利用我们对数据的理解,并避免在相同的KGs上进行冗余的工作。因此,我们引入了RuleHub,我们为公共KGs提供了可扩展的规则库。RuleHub为所有用户提供了规则的存档和检索功能,其可扩展的架构不限制KG或支持的规则类型。我们正在用来自最受欢迎的KGs的数千条规则填充语料库,并报告了我们使用统计度量自动表征规则质量的实验。
{"title":"RuleHub","authors":"N. Ahmadi, Thi-Thuy-Duyen Truong, Le-Hong-Mai Dao, Stefano Ortona, Paolo Papotti","doi":"10.1145/3409384","DOIUrl":"https://doi.org/10.1145/3409384","url":null,"abstract":"Entity-centric knowledge graphs (KGs) are now popular to collect facts about entities. KGs have rich schemas with a large number of different types and predicates to describe the entities and their relationships. On these rich schemas, logical rules are used to represent dependencies between the data elements. While rules are useful in query answering, data curation, and other tasks, they usually do not come with the KGs. Such rules have to be manually defined or discovered with the help of rule mining methods. We believe this rule-collection task should be done collectively to better capitalize our understanding of the data and to avoid redundant work conducted on the same KGs. For this reason, we introduce RuleHub, our extensible corpus of rules for public KGs. RuleHub provides functionalities for the archival and the retrieval of rules to all users, with an extensible architecture that does not constrain the KG or the type of rules supported. We are populating the corpus with thousands of rules from the most popular KGs and report on our experiments on automatically characterizing the quality of a rule with statistical measures.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"1 1","pages":"1 - 22"},"PeriodicalIF":0.0,"publicationDate":"2020-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89710693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Incremental Discovery of Imprecise Functional Dependencies 不精确功能依赖的增量发现
Pub Date : 2020-10-15 DOI: 10.1145/3397462
Loredana Caruccio, Stefano Cirillo
Functional dependencies (fds) are one of the metadata used to assess data quality and to perform data cleaning operations. However, to pursue robustness with respect to data errors, it has been necessary to devise imprecise versions of functional dependencies, yielding relaxed functional dependencies (rfds). Among them, there exists the class of rfds relaxing on the extent, i.e., those admitting the possibility that an fd holds on a subset of data. In the literature, several algorithms to automatically discover rfds from big data collections have been defined. They achieve good performances with respect to the inherent problem complexity. However, most of them are capable of discovering rfds only by batch processing the entire dataset. This is not suitable in the era of big data, where the size of a database instance can grow with high-velocity, and the insertion of new data can invalidate previously holding rfds. Thus, it is necessary to devise incremental discovery algorithms capable of updating the set of holding rfds upon data insertions, without processing the entire dataset. To this end, in this article we propose an incremental discovery algorithm for rfds relaxing on the extent. It manages the validation of candidate rfds and the generation of possibly new rfd candidates upon the insertion of the new tuples, while limiting the size of the overall search space. Experimental results show that the proposed algorithm achieves extremely good performances on real-world datasets.
功能依赖关系(fds)是用于评估数据质量和执行数据清理操作的元数据之一。然而,为了追求数据错误方面的健壮性,有必要设计不精确版本的功能依赖关系,从而产生宽松的功能依赖关系(rfds)。其中,存在一类在程度上放松的fd,即承认fd对数据子集持有的可能性。在文献中,已经定义了几种从大数据集合中自动发现rfds的算法。它们在固有问题复杂性方面取得了良好的性能。然而,它们中的大多数只能通过批处理整个数据集来发现rfds。这在大数据时代是不合适的,因为数据库实例的大小可以高速增长,而插入的新数据可能会使以前持有的数据无效。因此,有必要设计能够在数据插入时更新持有的rfd集的增量发现算法,而无需处理整个数据集。为此,在本文中,我们提出了一种基于程度放松的rfds增量发现算法。它管理候选rfd的验证,并在插入新元组时生成可能的新rfd候选,同时限制整个搜索空间的大小。实验结果表明,该算法在真实数据集上取得了非常好的性能。
{"title":"Incremental Discovery of Imprecise Functional Dependencies","authors":"Loredana Caruccio, Stefano Cirillo","doi":"10.1145/3397462","DOIUrl":"https://doi.org/10.1145/3397462","url":null,"abstract":"Functional dependencies (fds) are one of the metadata used to assess data quality and to perform data cleaning operations. However, to pursue robustness with respect to data errors, it has been necessary to devise imprecise versions of functional dependencies, yielding relaxed functional dependencies (rfds). Among them, there exists the class of rfds relaxing on the extent, i.e., those admitting the possibility that an fd holds on a subset of data. In the literature, several algorithms to automatically discover rfds from big data collections have been defined. They achieve good performances with respect to the inherent problem complexity. However, most of them are capable of discovering rfds only by batch processing the entire dataset. This is not suitable in the era of big data, where the size of a database instance can grow with high-velocity, and the insertion of new data can invalidate previously holding rfds. Thus, it is necessary to devise incremental discovery algorithms capable of updating the set of holding rfds upon data insertions, without processing the entire dataset. To this end, in this article we propose an incremental discovery algorithm for rfds relaxing on the extent. It manages the validation of candidate rfds and the generation of possibly new rfd candidates upon the insertion of the new tuples, while limiting the size of the overall search space. Experimental results show that the proposed algorithm achieves extremely good performances on real-world datasets.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"55 1","pages":"1 - 25"},"PeriodicalIF":0.0,"publicationDate":"2020-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84757638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
期刊
Journal of Data and Information Quality (JDIQ)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1