Data Intelligence最新文献

The Limitations and Ethical Considerations of ChatGPT 聊天GPT 的局限性和伦理考量

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2023-12-21 DOI: 10.1162/dint_a_00243

Shangying Hua, Shuangci Jin, Shengyi Jiang

With the advancements of artificial intelligence technology, ChatGPT, a new practice of artificial intelligence, holds immense potential across multiple fields. Its user-friendly human-machine interface, rapid response capabilities, and delivery of high-quality answers have attracted considerable attention and widespread usage. Regarded by many as a groundbreaking advancement in AI, ChatGPT represents a new milestone in the field. However, as with any technological evolution, the emergence of ChatGPT brings not only benefits, but also inevitable security risks and ethical issues. This paper provides specific information about ChatGPT, including its technology, limitations, ethical issues, governance paths and future directions. Specifically, we firstly offered a thorough exploration of the technical implementation details of GPT series models. Next, we provided an intricate analysis elucidating the reasons for limitations and scrutinized the consequential impacts, such as malicious misuse, privacy violation, and so on. Finally, we explore diverse governance paths to mitigate the impacts of ChatGPT and present future directions. This review aims to equip users with crucial knowledge, facilitating well-informed decision-making, effectively handling of potential challenges in employing ChatGPT, and staying abreast with the rapidly evolving landscape of this technology.

随着人工智能技术的发展，ChatGPT 作为人工智能的一种新实践，在多个领域蕴藏着巨大的潜力。它友好的人机界面、快速的响应能力和高质量的回答引起了广泛的关注和使用。ChatGPT 被许多人视为人工智能领域的突破性进展，代表了该领域的一个新里程碑。然而，与任何技术演进一样，ChatGPT 的出现不仅带来了好处，也不可避免地带来了安全风险和伦理问题。本文提供了有关 ChatGPT 的具体信息，包括其技术、局限性、伦理问题、治理路径和未来方向。具体来说，我们首先对 GPT 系列模型的技术实现细节进行了深入探讨。其次，我们深入分析了其局限性的原因，并仔细研究了其带来的影响，如恶意滥用、侵犯隐私等。最后，我们探讨了减轻 ChatGPT 影响的多种治理途径，并提出了未来的发展方向。本综述旨在为用户提供重要知识，帮助他们在充分知情的情况下做出决策，有效应对在使用 ChatGPT 过程中可能遇到的挑战，并紧跟该技术快速发展的步伐。

{"title":"The Limitations and Ethical Considerations of ChatGPT","authors":"Shangying Hua, Shuangci Jin, Shengyi Jiang","doi":"10.1162/dint_a_00243","DOIUrl":"https://doi.org/10.1162/dint_a_00243","url":null,"abstract":"\u0000 With the advancements of artificial intelligence technology, ChatGPT, a new practice of artificial intelligence, holds immense potential across multiple fields. Its user-friendly human-machine interface, rapid response capabilities, and delivery of high-quality answers have attracted considerable attention and widespread usage. Regarded by many as a groundbreaking advancement in AI, ChatGPT represents a new milestone in the field. However, as with any technological evolution, the emergence of ChatGPT brings not only benefits, but also inevitable security risks and ethical issues. This paper provides specific information about ChatGPT, including its technology, limitations, ethical issues, governance paths and future directions. Specifically, we firstly offered a thorough exploration of the technical implementation details of GPT series models. Next, we provided an intricate analysis elucidating the reasons for limitations and scrutinized the consequential impacts, such as malicious misuse, privacy violation, and so on. Finally, we explore diverse governance paths to mitigate the impacts of ChatGPT and present future directions. This review aims to equip users with crucial knowledge, facilitating well-informed decision-making, effectively handling of potential challenges in employing ChatGPT, and staying abreast with the rapidly evolving landscape of this technology.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"45 12","pages":""},"PeriodicalIF":3.9,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138949820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BIKAS: Bio-Inspired Knowledge Acquisition and Simulacrum—A Knowledge Database to Support Multifunctional Design Concept Generation BIKAS：生物启发知识获取与模拟--支持多功能设计概念生成的知识数据库

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2023-12-18 DOI: 10.1162/dint_a_00240

Pavan Tejaswi Velivela, Yaoyao Fiona Zhao

A detailed acquisition, analysis, and representation of biological systems exhibiting different functions is required to develop unique bio-inspired multifunctional conceptual designs and methods. This paper presents BIKAS: Bio-inspired Knowledge Acquisition and Simulacrum, a knowledge database of biological systems exhibiting various functionalities, developed based on case-based bio-inspired examples from literature. The knowledge database represents the biological features, their characteristics, and the function exhibited by the biological feature as a combination of its integrated structure and structural strategy. Furthermore, this knowledge database is utilized by the Expandable Domain Integrated Design (xDID) model that works on classifying, mapping, and representing biological features into their respective geometric designations called Domains. The combination of features from the Domains results in the generation of multifunctional conceptual designs. In addition, Meta-level design factors are proposed to aid designers in filtering the biological features and their respective functions having a similar structural strategy, thus aiding designers in rapidly selecting and emulating biological functions.

要开发出独特的生物启发多功能概念设计和方法，就需要对表现出不同功能的生物系统进行详细的获取、分析和表示。本文介绍了 BIKAS：生物启发知识获取与仿真（Bio-inspired Knowledge Acquisition and Simulacrum），这是一个展示各种功能的生物系统知识数据库，是根据文献中基于案例的生物启发实例开发的。该知识数据库代表了生物特征、生物特征的特点以及生物特征所表现出的功能，是生物特征的综合结构和结构策略的组合。此外，该知识数据库还被可扩展领域集成设计（xDID）模型所利用，该模型致力于将生物特征分类、映射并表示为各自的几何名称（称为 "领域"）。将各领域的特征组合起来，就能生成多功能的概念设计。此外，还提出了元级设计因素，以帮助设计人员筛选具有相似结构策略的生物特征及其各自的功能，从而帮助设计人员快速选择和模拟生物功能。

{"title":"BIKAS: Bio-Inspired Knowledge Acquisition and Simulacrum—A Knowledge Database to Support Multifunctional Design Concept Generation","authors":"Pavan Tejaswi Velivela, Yaoyao Fiona Zhao","doi":"10.1162/dint_a_00240","DOIUrl":"https://doi.org/10.1162/dint_a_00240","url":null,"abstract":"A detailed acquisition, analysis, and representation of biological systems exhibiting different functions is required to develop unique bio-inspired multifunctional conceptual designs and methods. This paper presents BIKAS: Bio-inspired Knowledge Acquisition and Simulacrum, a knowledge database of biological systems exhibiting various functionalities, developed based on case-based bio-inspired examples from literature. The knowledge database represents the biological features, their characteristics, and the function exhibited by the biological feature as a combination of its integrated structure and structural strategy. Furthermore, this knowledge database is utilized by the Expandable Domain Integrated Design (xDID) model that works on classifying, mapping, and representing biological features into their respective geometric designations called Domains. The combination of features from the Domains results in the generation of multifunctional conceptual designs. In addition, Meta-level design factors are proposed to aid designers in filtering the biological features and their respective functions having a similar structural strategy, thus aiding designers in rapidly selecting and emulating biological functions.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"257 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139173222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring Attentive Siamese LSTM for Low-Resource Text Plagiarism Detection 探索用于低资源文本抄袭检测的注意力连体 LSTM

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2023-12-18 DOI: 10.1162/dint_a_00242

Wei Bao, Jian Dong, Yang Xu, Yuanyuan Yang, Xiaoke Qi

Low-resource text plagiarism detection faces a significant challenge due to the limited availability of labeled data for training. This task requires the development of sophisticated algorithms capable of identifying similarities and differences in texts, particularly in the realm of semantic rewriting and translation-based plagiarism detection. In this paper, we present an enhanced attentive Siamese Long Short-Term Memory (LSTM) network designed for Tibetan-Chinese plagiarism detection. Our approach begins with the introduction of translation-based data augmentation, aimed at expanding the bilingual training dataset. Subsequently, we propose a pre-detection method leveraging abstract document vectors to enhance detection efficiency. Finally, we introduce an improved attentive Siamese LSTM network tailored for Tibetan-Chinese plagiarism detection. We conduct comprehensive experiments to showcase the effectiveness of our proposed plagiarism detection framework.

由于用于训练的标注数据有限，低资源文本抄袭检测面临着巨大挑战。这项任务需要开发能够识别文本异同的复杂算法，尤其是在语义改写和基于翻译的抄袭检测领域。在本文中，我们介绍了一种专为藏汉剽窃检测而设计的增强型殷勤暹罗长短期记忆（LSTM）网络。我们的方法首先引入了基于翻译的数据增强，旨在扩展双语训练数据集。随后，我们提出了一种利用抽象文档向量的预检测方法，以提高检测效率。最后，我们介绍了一种专为藏汉剽窃检测量身定制的改进型 Siamese LSTM 网络。我们进行了全面的实验，以展示我们提出的抄袭检测框架的有效性。

引用次数: 0

Rule Mining Trends from 1987 to 2022: A Bibliometric Analysis and Visualization 1987 年至 2022 年的规则挖掘趋势：文献计量分析与可视化

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2023-12-18 DOI: 10.1162/dint_a_00239

Shiqi Zhou, Sheng Bi, Guilin Qi

Rule mining has emerged as a crucial technique in data mining and knowledge discovery, enabling the extraction of valuable insights and patterns from vast datasets. This has garnered significant attention from both academic and industrial communities. However, there is a lack of bibliometric and visualization research on rule mining, leading to an unclear delineation of research topics and trends in the field. To fill this gap, this paper provides a comprehensive and up-to-date bibliometric analysis of rule mining, covering 4524 publications published between 1987 and 2022. Using various metrics and visualization techniques, we examine the patterns, trends, and evolution of rule mining. The results show a sustained growth in rule mining research, with a significant increase in publication output in recent years, and its rapid expansion into new areas such as explainable artificial intelligence and privacy protection. While the majority of publications come from Asia, the National Natural Science Foundation of China emerges as the top funding agency in the field. We also identify highly productive authors and significant members of co-authorship networks, as well as the most influential publications and citation bursts. The need for international collaboration and the integration of diverse research perspectives is highlighted. Despite the progress in rule mining, several challenges still require further research, including scalability and efficiency, explainability, network security and privacy protection, and personalized and user-centered design. Overall, this paper provides a valuable roadmap for researchers, policymakers, and practitioners interested in rule-mining research.

规则挖掘已成为数据挖掘和知识发现领域的一项重要技术，可从庞大的数据集中提取有价值的见解和模式。这引起了学术界和工业界的极大关注。然而，由于缺乏对规则挖掘的文献计量和可视化研究，导致该领域的研究课题和趋势划分不清。为了填补这一空白，本文对规则挖掘进行了全面、最新的文献计量分析，涵盖了 1987 年至 2022 年间发表的 4524 篇出版物。利用各种指标和可视化技术，我们研究了规则挖掘的模式、趋势和演变。结果表明，规则挖掘研究持续增长，近几年的出版物数量显著增加，并迅速扩展到可解释人工智能和隐私保护等新领域。虽然大多数论文来自亚洲，但中国国家自然科学基金会是该领域的顶级资助机构。我们还发现了高产作者和重要的合著网络成员，以及最具影响力的出版物和引文爆发。我们强调了国际合作和整合不同研究视角的必要性。尽管在规则挖掘方面取得了进展，但仍有一些挑战需要进一步研究，包括可扩展性和效率、可解释性、网络安全和隐私保护，以及个性化和以用户为中心的设计。总之，本文为对规则挖掘研究感兴趣的研究人员、决策者和从业人员提供了宝贵的路线图。

{"title":"Rule Mining Trends from 1987 to 2022: A Bibliometric Analysis and Visualization","authors":"Shiqi Zhou, Sheng Bi, Guilin Qi","doi":"10.1162/dint_a_00239","DOIUrl":"https://doi.org/10.1162/dint_a_00239","url":null,"abstract":"\u0000 Rule mining has emerged as a crucial technique in data mining and knowledge discovery, enabling the extraction of valuable insights and patterns from vast datasets. This has garnered significant attention from both academic and industrial communities. However, there is a lack of bibliometric and visualization research on rule mining, leading to an unclear delineation of research topics and trends in the field. To fill this gap, this paper provides a comprehensive and up-to-date bibliometric analysis of rule mining, covering 4524 publications published between 1987 and 2022. Using various metrics and visualization techniques, we examine the patterns, trends, and evolution of rule mining. The results show a sustained growth in rule mining research, with a significant increase in publication output in recent years, and its rapid expansion into new areas such as explainable artificial intelligence and privacy protection. While the majority of publications come from Asia, the National Natural Science Foundation of China emerges as the top funding agency in the field. We also identify highly productive authors and significant members of co-authorship networks, as well as the most influential publications and citation bursts. The need for international collaboration and the integration of diverse research perspectives is highlighted. Despite the progress in rule mining, several challenges still require further research, including scalability and efficiency, explainability, network security and privacy protection, and personalized and user-centered design. Overall, this paper provides a valuable roadmap for researchers, policymakers, and practitioners interested in rule-mining research.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":" 12","pages":""},"PeriodicalIF":3.9,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138963544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Classification and quantification of timestamp data quality issues and its impact on data quality outcome 时间戳数据质量问题的分类和量化及其对数据质量结果的影响

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2023-12-18 DOI: 10.1162/dint_a_00238

Rex Ambe

Timestamps play a key role in process mining because it determines the chronology of which events occurred and subsequently how they are ordered in process modelling. The timestamp in process mining gives an insight on process performance, conformance, and modelling. This therefore means problems with the timestamp will result in misrepresentations of the mined process. A few articles have been published on the quantification of data quality problems but just one of the articles at the time of this paper is based on the quantification of timestamp quality problems. This article evaluates the quality of timestamps in event log across two axes using eleven quality dimensions and four levels of potential data quality problems. The eleven data quality dimensions were obtained by doing a thorough literature review of more than fifty process mining articles which focus on quality dimensions. This evaluation resulted in twelve data quality quantification metrics and the metrics were applied to the MIMIC-III dataset as an illustration. The outcome of the timestamp quality quantification using the proposed typology enabled the user to appreciate the quality of the event log and thus makes it possible to evaluate the risk of carrying out specific data cleaning measures to improve the process mining outcome.

时间戳在流程挖掘中起着关键作用，因为它决定了事件发生的时间顺序，以及随后在流程建模中如何排序。流程挖掘中的时间戳能让人深入了解流程性能、一致性和建模情况。因此，这意味着时间戳的问题会导致挖掘出的流程出现错误表述。关于数据质量问题量化的文章已经发表了几篇，但在本文发表时，只有一篇文章是基于时间戳质量问题量化的。本文使用 11 个质量维度和潜在数据质量问题的 4 个等级，从两个轴评估了事件日志中的时间戳质量。这 11 个数据质量维度是通过对 50 多篇关注质量维度的流程挖掘文章进行全面的文献综述获得的。评估得出了十二个数据质量量化指标，并将这些指标应用于 MIMIC-III 数据集作为示例。使用建议的类型学对时间戳质量进行量化的结果使用户能够了解事件日志的质量，从而能够评估为改善流程挖掘结果而采取特定数据清理措施的风险。

{"title":"Classification and quantification of timestamp data quality issues and its impact on data quality outcome","authors":"Rex Ambe","doi":"10.1162/dint_a_00238","DOIUrl":"https://doi.org/10.1162/dint_a_00238","url":null,"abstract":"\u0000 Timestamps play a key role in process mining because it determines the chronology of which events occurred and subsequently how they are ordered in process modelling. The timestamp in process mining gives an insight on process performance, conformance, and modelling. This therefore means problems with the timestamp will result in misrepresentations of the mined process. A few articles have been published on the quantification of data quality problems but just one of the articles at the time of this paper is based on the quantification of timestamp quality problems. This article evaluates the quality of timestamps in event log across two axes using eleven quality dimensions and four levels of potential data quality problems. The eleven data quality dimensions were obtained by doing a thorough literature review of more than fifty process mining articles which focus on quality dimensions. This evaluation resulted in twelve data quality quantification metrics and the metrics were applied to the MIMIC-III dataset as an illustration. The outcome of the timestamp quality quantification using the proposed typology enabled the user to appreciate the quality of the event log and thus makes it possible to evaluate the risk of carrying out specific data cleaning measures to improve the process mining outcome.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"123 2","pages":""},"PeriodicalIF":3.9,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138995057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparison of Parallel Genetic Algorithm and Particle Swarm Optimization for Parameter Calibration in Hydrological Simulation 并行遗传算法与粒子群算法在水文模拟参数标定中的比较

3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2023-11-09 DOI: 10.1162/dint_a_00221

Xinyu Zhang, Yang Li, Genshen Chu

Parameter calibration is an important part of hydrological simulation and affects the final simulation results. In this paper, we introduce heuristic optimization algorithms, genetic algorithm (GA) to cope with the complexity of the parameter calibration problem, and use particle swarm optimization algorithm (PSO) as a comparison. For large scale hydrological simulations, we use a multilevel parallel parameter calibration framework to make full use of processor resources, accelerate the process of solving high-dimensional parameter calibration. Further, we test and apply the experiments on domestic supercomputers. The results of parameter calibration with GA and PSO can basically reach the ideal value of 0.65 and above, with PSO achieving a speedup of 58.52 on TianHe-2 supercomputer. The experimental results indicate that by using a parallel implementation on multicore CPUs, high-dimensional parameter calibration in large scale hydrological simulation is possible. Moreover, our comparison of the two algorithms shows that the GA obtains better calibration results, and the PSO has a more pronounced acceleration effect.

参数定标是水文模拟的重要环节，影响着最终的模拟结果。本文介绍了启发式优化算法、遗传算法(GA)来处理复杂的参数标定问题，并与粒子群优化算法(PSO)进行了比较。针对大尺度水文模拟，采用多级并行参数定标框架，充分利用处理器资源，加快求解高维参数定标过程。并在国产超级计算机上进行了测试和应用。采用遗传算法和粒子群算法进行参数标定的结果基本可以达到0.65及以上的理想值，其中粒子群算法在天河二号超级计算机上实现了58.52的加速。实验结果表明，通过多核cpu并行实现，可以实现大尺度水文模拟的高维参数标定。此外，我们对两种算法的比较表明，遗传算法获得了更好的校准结果，粒子群算法具有更明显的加速效果。

{"title":"Comparison of Parallel Genetic Algorithm and Particle Swarm Optimization for Parameter Calibration in Hydrological Simulation","authors":"Xinyu Zhang, Yang Li, Genshen Chu","doi":"10.1162/dint_a_00221","DOIUrl":"https://doi.org/10.1162/dint_a_00221","url":null,"abstract":"Parameter calibration is an important part of hydrological simulation and affects the final simulation results. In this paper, we introduce heuristic optimization algorithms, genetic algorithm (GA) to cope with the complexity of the parameter calibration problem, and use particle swarm optimization algorithm (PSO) as a comparison. For large scale hydrological simulations, we use a multilevel parallel parameter calibration framework to make full use of processor resources, accelerate the process of solving high-dimensional parameter calibration. Further, we test and apply the experiments on domestic supercomputers. The results of parameter calibration with GA and PSO can basically reach the ideal value of 0.65 and above, with PSO achieving a speedup of 58.52 on TianHe-2 supercomputer. The experimental results indicate that by using a parallel implementation on multicore CPUs, high-dimensional parameter calibration in large scale hydrological simulation is possible. Moreover, our comparison of the two algorithms shows that the GA obtains better calibration results, and the PSO has a more pronounced acceleration effect.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":" 23","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135192134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Building expertise on FAIR through evolving Bring Your Own Data (BYOD) workshops: describing the data, software, and management- focused approaches and their evolution 通过发展自带数据(BYOD)研讨会在FAIR上建立专业知识:描述数据、软件和以管理为重点的方法及其演变

3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2023-11-07 DOI: 10.1162/dint_a_00236

César H. Bernabé, Lieze Thielemans, Rajaram Kaliyaperumal, Claudio Carta, Shuxin Zhang, Celia W.G. van Gelder, Nirupama Benis, Luiz Olavo Bonino da Silva Santos, Ronald Cornet, Bruna dos Santos Vieira, Nawel Lalout, Ines Henriques, Alberto Cámara Ballesteros, Kees Burger, Martijn G. Kersloot, Friederike Ehrhart, Esther van Enckevort, Chris T. Evelo, Alasdair J. G. Gray, Marc Hanauer, Kristina Hettne, Joep de Ligt, Arnaldo Pereira, Núria Queralt-Rosinach, Erik Schultes, Domenica Taruscio, Andra Waagmeester, Mark D. Wilkinson, Egon L. Willighagen, Mascha Jansen, Barend Mons, Marco Roos, Annika Jacobsen

Abstract Since 2014, “Bring Your Own Data” workshops (BYODs) have been organised to inform people about the process and benefits of making resources Findable, Accessible, Interoperable, and Reusable (FAIR, and the FAIRification process). The BYOD workshops’ content and format differ depending on their goal, context, and the background and needs of participants. Data-focused BYODs educate domain experts on how to make their data FAIR to find new answers to research questions. Management-focused BYODs promote the benefits of making data FAIR and instruct project managers and policy-makers on the characteristics of FAIRification projects. Software-focused BYODs gather software developers and experts on FAIR to implement or improve software resources that are used to support FAIRification. Overall, these BYODs intend to foster collaboration between different types of stakeholders involved in data management, curation, and reuse (e.g. domain experts, trainers, developers, data owners, data analysts, FAIR experts). The BYODs also serve as an opportunity to learn what kind of support for FAIRification is needed from different communities and to develop teaching materials based on practical examples and experience. In this paper, we detail the three different structures of the BYODs and describe examples of early BYODs related to plant breeding data, and rare disease registries and biobanks, which have shaped the structure of the workshops. We discuss the latest insights into making BYODs more productive by leveraging our almost ten years of training experience in these workshops, including successes and encountered challenges. Finally, we examine how the participants’ feedback has motivated the research on FAIR, including the development of workflows and software.

自2014年以来，“自带数据”研讨会(byod)已经组织起来，向人们介绍使资源可查找、可访问、可互操作和可重用(FAIR和FAIRification过程)的过程和好处。BYOD研讨会的内容和形式根据其目标、背景、参与者的背景和需求而有所不同。以数据为中心的byod教育领域专家如何使他们的数据公平，以找到研究问题的新答案。以管理为中心的byod促进了数据公平的好处，并指导项目经理和政策制定者了解公平项目的特点。以软件为中心的byod聚集了FAIR的软件开发人员和专家，以实现或改进用于支持FAIRification的软件资源。总的来说，这些byod旨在促进参与数据管理、管理和重用的不同类型利益相关者之间的协作(例如领域专家、培训师、开发人员、数据所有者、数据分析师、FAIR专家)。自带设备还提供了一个机会，可以了解不同社区需要什么样的支持来促进公平化，并根据实际例子和经验编写教材。在本文中，我们详细介绍了三种不同的byod结构，并描述了与植物育种数据、罕见疾病登记和生物库相关的早期byod的例子，这些例子塑造了研讨会的结构。通过利用我们在这些研讨会上近十年的培训经验，包括成功和遇到的挑战，我们讨论了使byod更有效率的最新见解。最后，我们研究了参与者的反馈如何推动了FAIR的研究，包括工作流程和软件的开发。

{"title":"Building expertise on FAIR through evolving Bring Your Own Data (BYOD) workshops: describing the data, software, and management- focused approaches and their evolution","authors":"César H. Bernabé, Lieze Thielemans, Rajaram Kaliyaperumal, Claudio Carta, Shuxin Zhang, Celia W.G. van Gelder, Nirupama Benis, Luiz Olavo Bonino da Silva Santos, Ronald Cornet, Bruna dos Santos Vieira, Nawel Lalout, Ines Henriques, Alberto Cámara Ballesteros, Kees Burger, Martijn G. Kersloot, Friederike Ehrhart, Esther van Enckevort, Chris T. Evelo, Alasdair J. G. Gray, Marc Hanauer, Kristina Hettne, Joep de Ligt, Arnaldo Pereira, Núria Queralt-Rosinach, Erik Schultes, Domenica Taruscio, Andra Waagmeester, Mark D. Wilkinson, Egon L. Willighagen, Mascha Jansen, Barend Mons, Marco Roos, Annika Jacobsen","doi":"10.1162/dint_a_00236","DOIUrl":"https://doi.org/10.1162/dint_a_00236","url":null,"abstract":"Abstract Since 2014, “Bring Your Own Data” workshops (BYODs) have been organised to inform people about the process and benefits of making resources Findable, Accessible, Interoperable, and Reusable (FAIR, and the FAIRification process). The BYOD workshops’ content and format differ depending on their goal, context, and the background and needs of participants. Data-focused BYODs educate domain experts on how to make their data FAIR to find new answers to research questions. Management-focused BYODs promote the benefits of making data FAIR and instruct project managers and policy-makers on the characteristics of FAIRification projects. Software-focused BYODs gather software developers and experts on FAIR to implement or improve software resources that are used to support FAIRification. Overall, these BYODs intend to foster collaboration between different types of stakeholders involved in data management, curation, and reuse (e.g. domain experts, trainers, developers, data owners, data analysts, FAIR experts). The BYODs also serve as an opportunity to learn what kind of support for FAIRification is needed from different communities and to develop teaching materials based on practical examples and experience. In this paper, we detail the three different structures of the BYODs and describe examples of early BYODs related to plant breeding data, and rare disease registries and biobanks, which have shaped the structure of the workshops. We discuss the latest insights into making BYODs more productive by leveraging our almost ten years of training experience in these workshops, including successes and encountered challenges. Finally, we examine how the participants’ feedback has motivated the research on FAIR, including the development of workflows and software.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"50 21","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135432718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ChatGPT is a Remarkable Tool—For Experts ChatGPT是一个了不起的专家工具

3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2023-11-07 DOI: 10.1162/dint_a_00235

Amos Azaria, Rina Azoulay, Shulamit Reches

Abstract This paper investigates the capabilities of ChatGPT as an automated assistant in diverse domains, including scientific writing, mathematics, education, programming, and healthcare. We explore the potential of ChatGPT to enhance productivity, streamline problem-solving processes, and improve writing style. Furthermore, we highlight the potential risks associated with excessive reliance on ChatGPT in these fields. These limitations encompass factors like incorrect and fictitious responses, inaccuracies in code, limited logical reasoning abilities, overconfidence, and critical ethical concerns of copyright and privacy violation. We outline areas and objectives where ChatGPT proves beneficial, applications where it should be used judiciously, and scenarios where its reliability may be limited. In light of observed limitations, and given that the tool's fundamental errors may pose a special challenge for non-experts, ChatGPT should be used with a strategic methodology. By drawing from comprehensive experimental studies, we offer methods and flowcharts for effectively using ChatGPT. Our recommendations emphasize iterative interaction with ChatGPT and independent verification of its outputs. Considering the importance of utilizing ChatGPT judiciously and with expertise, we recommend its usage for experts who are well-versed in the respective domains.

本文研究了ChatGPT作为自动化助手在不同领域的能力，包括科学写作、数学、教育、编程和医疗保健。我们探索ChatGPT在提高生产力、简化问题解决过程和改善写作风格方面的潜力。此外，我们强调了在这些领域过度依赖ChatGPT的潜在风险。这些限制包括不正确和虚构的响应、代码不准确、有限的逻辑推理能力、过度自信以及侵犯版权和隐私的关键道德问题等因素。我们概述了ChatGPT证明有用的领域和目标、应该谨慎使用它的应用程序以及它的可靠性可能受到限制的场景。鉴于观察到的局限性，并且考虑到该工具的基本错误可能对非专家构成特殊挑战，ChatGPT应该与战略方法一起使用。通过综合实验研究，我们提供了有效使用ChatGPT的方法和流程图。我们的建议强调与ChatGPT的迭代交互和对其输出的独立验证。考虑到明智地使用ChatGPT和专业知识的重要性，我们建议精通各自领域的专家使用它。

{"title":"ChatGPT is a Remarkable Tool—For Experts","authors":"Amos Azaria, Rina Azoulay, Shulamit Reches","doi":"10.1162/dint_a_00235","DOIUrl":"https://doi.org/10.1162/dint_a_00235","url":null,"abstract":"Abstract This paper investigates the capabilities of ChatGPT as an automated assistant in diverse domains, including scientific writing, mathematics, education, programming, and healthcare. We explore the potential of ChatGPT to enhance productivity, streamline problem-solving processes, and improve writing style. Furthermore, we highlight the potential risks associated with excessive reliance on ChatGPT in these fields. These limitations encompass factors like incorrect and fictitious responses, inaccuracies in code, limited logical reasoning abilities, overconfidence, and critical ethical concerns of copyright and privacy violation. We outline areas and objectives where ChatGPT proves beneficial, applications where it should be used judiciously, and scenarios where its reliability may be limited. In light of observed limitations, and given that the tool's fundamental errors may pose a special challenge for non-experts, ChatGPT should be used with a strategic methodology. By drawing from comprehensive experimental studies, we offer methods and flowcharts for effectively using ChatGPT. Our recommendations emphasize iterative interaction with ChatGPT and independent verification of its outputs. Considering the importance of utilizing ChatGPT judiciously and with expertise, we recommend its usage for experts who are well-versed in the respective domains.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"50 23","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135432717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Theoretically Grounded Question Answering Data Set for Evaluating Machine Common Sense 用于评估机器常识的理论基础问答数据集

3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2023-11-07 DOI: 10.1162/dint_a_00234

Henrique Santos, Ke Shen, Alice M. Mulvehill, Mayank Kejriwal, Deborah L. McGuinness

ABSTRACT Achieving machine common sense has been a longstanding problem within Artificial Intelligence. Thus far, benchmark data sets that are grounded in a theory of common sense and can be used to conduct rigorous, semantic evaluations of common sense reasoning (CSR) systems have been lacking. One expectation of the AI community is that neuro-symbolic reasoners can help bridge this gap towards more dependable systems with common sense. We propose a novel benchmark, called Theoretically Grounded common sense Reasoning (TG-CSR), modeled as a set of question answering instances, with each instance grounded in a semantic category of common sense, such as space, time, and emotions. The benchmark is few-shot i.e., only a few training and validation examples are provided in the public release to avoid the possibility of overfitting. Results from recent evaluations suggest that TG-CSR is challenging even for state-of-the-art statistical models. Due to its semantic rigor, this benchmark can be used to evaluate the common sense reasoning capabilities of neuro-symbolic systems.

实现机器常识一直是人工智能领域一个长期存在的问题。到目前为止，还缺乏基于常识理论并可用于对常识推理(CSR)系统进行严格的语义评估的基准数据集。人工智能社区的一个期望是，神经符号推理器可以帮助弥合这一差距，使系统具有更可靠的常识。我们提出了一个新的基准，称为基于理论的常识推理(TG-CSR)，将其建模为一组问答实例，每个实例都基于常识的语义类别，如空间、时间和情感。基准测试是few-shot的，即在公开发布中只提供了少量的训练和验证示例，以避免过度拟合的可能性。最近的评估结果表明，TG-CSR即使对于最先进的统计模型也是具有挑战性的。由于它的语义严谨性，这个基准可以用来评估神经符号系统的常识推理能力。

引用次数: 0

Improving Extraction of Chinese Open Relations Using Pre-trained Language Model and Knowledge Enhancement 利用预训练语言模型和知识增强改进中文开放关系的提取

3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2023-11-07 DOI: 10.1162/dint_a_00227

Chaojie Wen, Xudong Jia, Tao Chen

Abstract Open Relation Extraction (ORE) is a task of extracting semantic relations from a text document. Current ORE systems have significantly improved their efficiency in obtaining Chinese relations, when compared with conventional systems which heavily depend on feature engineering or syntactic parsing. However, the ORE systems do not use robust neural networks such as pre-trained language models to take advantage of large-scale unstructured data effectively. In respons to this issue, a new system entitled Chinese Open Relation Extraction with Knowledge Enhancement (CORE-KE) is presented in this paper. The CORE-KE system employs a pre-trained language model (with the support of a Bidirectional Long Short-Term Memory (BiLSTM) layer and a Masked Conditional Random Field (Masked CRF) layer) on unstructured data in order to improve Chinese open relation extraction. Entity descriptions in Wikidata and additional knowledge (in terms of triple facts) extracted from Chinese ORE datasets are used to fine-tune the pre-trained language model. In addition, syntactic features are further adopted in the training stage of the CORE-KE system for knowledge enhancement. Experimental results of the CORE-KE system on two large-scale datasets of open Chinese entities and relations demonstrate that the CORE-KE system is superior to other ORE systems. The F1-scores of the CORE-KE system on the two datasets have given a relative improvement of 20.1% and 1.3%, when compared with benchmark ORE systems, respectively. The source code is available at https://github.com/cjwen15/CORE-KE.

开放关系抽取(Open Relation Extraction, ORE)是一种从文本文档中抽取语义关系的任务。与传统的依赖特征工程或句法分析的系统相比，现有的ORE系统在获取中文关系方面的效率有了显著提高。然而，ORE系统没有使用鲁棒神经网络(如预训练语言模型)来有效地利用大规模非结构化数据。针对这一问题，本文提出了一个基于知识增强的中文开放关系抽取系统(CORE-KE)。CORE-KE系统在非结构化数据上采用预先训练的语言模型(支持双向长短期记忆(BiLSTM)层和屏蔽条件随机场(masking Conditional Random Field)层)，以改进中文开放关系的提取。使用维基数据中的实体描述和从中文ORE数据集中提取的附加知识(就三重事实而言)来微调预训练的语言模型。此外，在CORE-KE系统的训练阶段进一步采用句法特征进行知识增强。在开放中文实体和关系两个大型数据集上的实验结果表明，CORE-KE系统优于其他ORE系统。与基准ORE系统相比，CORE-KE系统在这两个数据集上的f1分数分别提高了20.1%和1.3%。源代码可从https://github.com/cjwen15/CORE-KE获得。

{"title":"Improving Extraction of Chinese Open Relations Using Pre-trained Language Model and Knowledge Enhancement","authors":"Chaojie Wen, Xudong Jia, Tao Chen","doi":"10.1162/dint_a_00227","DOIUrl":"https://doi.org/10.1162/dint_a_00227","url":null,"abstract":"Abstract Open Relation Extraction (ORE) is a task of extracting semantic relations from a text document. Current ORE systems have significantly improved their efficiency in obtaining Chinese relations, when compared with conventional systems which heavily depend on feature engineering or syntactic parsing. However, the ORE systems do not use robust neural networks such as pre-trained language models to take advantage of large-scale unstructured data effectively. In respons to this issue, a new system entitled Chinese Open Relation Extraction with Knowledge Enhancement (CORE-KE) is presented in this paper. The CORE-KE system employs a pre-trained language model (with the support of a Bidirectional Long Short-Term Memory (BiLSTM) layer and a Masked Conditional Random Field (Masked CRF) layer) on unstructured data in order to improve Chinese open relation extraction. Entity descriptions in Wikidata and additional knowledge (in terms of triple facts) extracted from Chinese ORE datasets are used to fine-tune the pre-trained language model. In addition, syntactic features are further adopted in the training stage of the CORE-KE system for knowledge enhancement. Experimental results of the CORE-KE system on two large-scale datasets of open Chinese entities and relations demonstrate that the CORE-KE system is superior to other ORE systems. The F1-scores of the CORE-KE system on the two datasets have given a relative improvement of 20.1% and 1.3%, when compared with benchmark ORE systems, respectively. The source code is available at https://github.com/cjwen15/CORE-KE.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"50 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135432728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0