首页 > 最新文献

Information Processing & Management最新文献

英文 中文
FairColor: An efficient algorithm for the Balanced and Fair Reviewer Assignment Problem FairColor:平衡与公平审稿人分配问题的高效算法
IF 7.4 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-22 DOI: 10.1016/j.ipm.2024.103865
Khadra Bouanane , Abdeldjaouad Nusayr Medakene , Abdellah Benbelghit , Samir Brahim Belhaouari

As the volume of submitted papers continues to rise, ensuring a fair and accurate assignment of manuscripts to reviewers has become increasingly critical for academic conference organizers. Given the paper-reviewer similarity scores, this study introduces the Balanced and Fair Reviewer Assignment Problem (BFRAP), which aims to maximize the overall similarity score (efficiency) and the minimum paper score (fairness) subject to coverage, load balance, and fairness constraints. Addressing the challenges posed by these constraints, we conduct a theoretical investigation into the threshold conditions for the problem’s feasibility and optimality. To facilitate this investigation, we establish a connection between BFRAP, defined over m reviewers, and the Equitable m-Coloring Problem. Building on this theoretical foundation, we propose FairColor, an algorithm designed to retrieve fair and efficient assignments. We compare FairColor to Fairflow and FairIR, two state-of-the-art algorithms designed to find fair assignments under similar constraints. Empirical experiments were conducted on four real and two synthetic datasets involving (paper, reviewer) matching scores ranging from (100,100) to (10124,5880). Results demonstrate that FairColor is able to find efficient and fair assignments quickly compared to Fairflow and FairIR. Notably, in the largest instance involving 10,124 manuscripts and 5680 reviewers, FairColor retrieves fair and efficient assignments in just 67.64 s. This starkly contrasts both other methods, which require significantly longer computation times (45 min for Fairflow and 3 h 24 min for FairIR), even on more powerful machines. These results underscore FairColor as a promising alternative to current state-of-the-art assignment techniques.

随着提交论文数量的不断增加,确保将稿件公平、准确地分配给审稿人对学术会议组织者来说越来越重要。考虑到论文与审稿人的相似度得分,本研究提出了平衡与公平审稿人分配问题(BFRAP),其目标是在覆盖范围、负载平衡和公平性约束条件下,最大化总体相似度得分(效率)和最小化论文得分(公平性)。为了应对这些约束带来的挑战,我们对问题可行性和最优性的阈值条件进行了理论研究。为便于研究,我们在定义为 m 个审稿人的 BFRAP 与公平 m 染色问题之间建立了联系。在此理论基础上,我们提出了 FairColor,一种旨在检索公平高效分配的算法。我们将 FairColor 与 Fairflow 和 FairIR 进行了比较,这两种最先进的算法是为了在类似的约束条件下找到公平的任务分配而设计的。我们在四个真实数据集和两个合成数据集上进行了实证实验,涉及的(论文、审稿人)匹配分数范围从(100,100)到(10124,5880)不等。结果表明,与 Fairflow 和 FairIR 相比,FairColor 能够快速找到高效、公平的分配。值得注意的是,在涉及 10124 篇稿件和 5680 名审稿人的最大实例中,FairColor 仅用了 67.64 秒就检索到了公平高效的分配。这与其他两种方法形成了鲜明对比,后者需要的计算时间要长得多(Fairflow 需要 45 分钟,FairIR 需要 3 小时 24 分钟),即使在更强大的机器上也是如此。这些结果表明,FairColor 是目前最先进的分配技术的一个很有前途的替代方案。
{"title":"FairColor: An efficient algorithm for the Balanced and Fair Reviewer Assignment Problem","authors":"Khadra Bouanane ,&nbsp;Abdeldjaouad Nusayr Medakene ,&nbsp;Abdellah Benbelghit ,&nbsp;Samir Brahim Belhaouari","doi":"10.1016/j.ipm.2024.103865","DOIUrl":"10.1016/j.ipm.2024.103865","url":null,"abstract":"<div><p>As the volume of submitted papers continues to rise, ensuring a fair and accurate assignment of manuscripts to reviewers has become increasingly critical for academic conference organizers. Given the paper-reviewer similarity scores, this study introduces the Balanced and Fair Reviewer Assignment Problem (BFRAP), which aims to maximize the overall similarity score (efficiency) and the minimum paper score (fairness) subject to coverage, load balance, and fairness constraints. Addressing the challenges posed by these constraints, we conduct a theoretical investigation into the threshold conditions for the problem’s feasibility and optimality. To facilitate this investigation, we establish a connection between BFRAP, defined over <span><math><mi>m</mi></math></span> reviewers, and the Equitable m-Coloring Problem. Building on this theoretical foundation, we propose FairColor, an algorithm designed to retrieve fair and efficient assignments. We compare FairColor to Fairflow and FairIR, two state-of-the-art algorithms designed to find fair assignments under similar constraints. Empirical experiments were conducted on four real and two synthetic datasets involving (paper, reviewer) matching scores ranging from (100,100) to (10124,5880). Results demonstrate that FairColor is able to find efficient and fair assignments quickly compared to Fairflow and FairIR. Notably, in the largest instance involving 10,124 manuscripts and 5680 reviewers, FairColor retrieves fair and efficient assignments in just 67.64 s. This starkly contrasts both other methods, which require significantly longer computation times (45 min for Fairflow and 3 h 24 min for FairIR), even on more powerful machines. These results underscore FairColor as a promising alternative to current state-of-the-art assignment techniques.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"61 6","pages":"Article 103865"},"PeriodicalIF":7.4,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142040406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An adaptive approach to noisy annotations in scientific information extraction 科学信息提取中噪声注释的自适应方法
IF 7.4 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-12 DOI: 10.1016/j.ipm.2024.103857
Necva Bölücü, Maciej Rybinski, Xiang Dai, Stephen Wan

Despite recent advances in large language models (LLMs), the best effectiveness in information extraction (IE) is still achieved by fine-tuned models, hence the need for manually annotated datasets to train them. However, collecting human annotations for IE, especially for scientific IE, where expert annotators are often required, is expensive and time-consuming. Another issue widely discussed in the IE community is noisy annotations. Mislabelled training samples can hamper the effectiveness of trained models. In this paper, we propose a solution to alleviate problems originating from the high cost and difficulty of the annotation process. Our method distinguishes clean training samples from noisy samples and then employs weighted weakly supervised learning (WWSL) to leverage noisy annotations. Evaluation of Named Entity Recognition (NER) and Relation Classification (RC) tasks in Scientific IE demonstrates the substantial impact of detecting clean samples. Experimental results highlight that our method, utilising clean and noisy samples with WWSL, outperforms the baseline RoBERTa on NER (+4.28, +4.59, +29.27, and +5.21 gain for the ADE, SciERC, STEM-ECR, and WLPC datasets, respectively) and the RC (+6.09 and +4.39 gain for the SciERC and WLPC datasets, respectively) tasks. Comprehensive analyses of our method reveal its advantages over state-of-the-art denoising baseline models in scientific NER. Moreover, the framework is general enough to be adapted to different NLP tasks or domains, which means it could be useful in the broader NLP community.

尽管最近在大型语言模型(LLMs)方面取得了进展,但信息提取(IE)的最佳效果仍然要通过微调模型来实现,因此需要人工标注的数据集来训练这些模型。然而,为信息提取(IE)收集人工标注,尤其是科学信息提取(IE),往往需要专家标注者,这既昂贵又耗时。IE 界广泛讨论的另一个问题是注释噪声。错误标注的训练样本会影响训练模型的有效性。在本文中,我们提出了一种解决方案,以缓解因标注过程成本高、难度大而产生的问题。我们的方法能将干净的训练样本与噪声样本区分开来,然后采用加权弱监督学习(WWSL)来利用噪声注释。对科学 IE 中的命名实体识别(NER)和关系分类(RC)任务的评估证明了检测干净样本的重大影响。实验结果表明,我们的方法利用了带有 WWSL 的干净样本和噪声样本,在 NER(+4.28、+4.59、+29.27 和 +5.00)和 RC(+5.00、+5.00 和 +5.00)方面优于基线 RoBERTa。ADE、SciERC、STEM-ECR 和 WLPC 数据集的增益分别为 21)和 RC(SciERC 和 WLPC 数据集的增益分别为 +6.09 和 +4.39)任务。对我们的方法进行的综合分析表明,它比科学 NER 中最先进的去噪基线模型更具优势。此外,该框架具有足够的通用性,可以适用于不同的 NLP 任务或领域,这意味着它可以在更广泛的 NLP 社区中发挥作用。
{"title":"An adaptive approach to noisy annotations in scientific information extraction","authors":"Necva Bölücü,&nbsp;Maciej Rybinski,&nbsp;Xiang Dai,&nbsp;Stephen Wan","doi":"10.1016/j.ipm.2024.103857","DOIUrl":"10.1016/j.ipm.2024.103857","url":null,"abstract":"<div><p>Despite recent advances in large language models (LLMs), the best effectiveness in information extraction (IE) is still achieved by fine-tuned models, hence the need for manually annotated datasets to train them. However, collecting human annotations for IE, especially for scientific IE, where expert annotators are often required, is expensive and time-consuming. Another issue widely discussed in the IE community is noisy annotations. Mislabelled training samples can hamper the effectiveness of trained models. In this paper, we propose a solution to alleviate problems originating from the high cost and difficulty of the annotation process. Our method distinguishes <em>clean</em> training samples from <em>noisy</em> samples and then employs weighted weakly supervised learning (WWSL) to leverage noisy annotations. Evaluation of Named Entity Recognition (NER) and Relation Classification (RC) tasks in Scientific IE demonstrates the substantial impact of detecting clean samples. Experimental results highlight that our method, utilising clean and noisy samples with WWSL, outperforms the baseline RoBERTa on NER (+4.28, +4.59, +29.27, and +5.21 gain for the ADE, SciERC, STEM-ECR, and WLPC datasets, respectively) and the RC (+6.09 and +4.39 gain for the SciERC and WLPC datasets, respectively) tasks. Comprehensive analyses of our method reveal its advantages over state-of-the-art denoising baseline models in scientific NER. Moreover, the framework is general enough to be adapted to different NLP tasks or domains, which means it could be useful in the broader NLP community.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"61 6","pages":"Article 103857"},"PeriodicalIF":7.4,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306457324002164/pdfft?md5=fff788405d49af01c42a5d5a7a592f76&pid=1-s2.0-S0306457324002164-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141979785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust and resource-efficient table-based fact verification through multi-aspect adversarial contrastive learning 通过多视角对抗性对比学习,实现基于表格的稳健且资源节约型事实验证
IF 7.4 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-12 DOI: 10.1016/j.ipm.2024.103853
Ruiheng Liu , Yu Zhang , Bailong Yang , Qi Shi , Luogeng Tian

Table-based fact verification focuses on determining the truthfulness of statements by cross-referencing data in tables. This task is challenging due to the complex interactions inherent in table structures. To address this challenge, existing methods have devised a range of specialized models. Although these models demonstrate good performance, they still exhibit limited sensitivity to essential variations in information relevant to reasoning within both statements and tables, thus learning spurious patterns and leading to potentially unreliable outcomes. In this work, we propose a novel approach named Multi-Aspect Adversarial Contrastive Learning (Macol), aimed at enhancing the accuracy and robustness of table-based fact verification systems under the premise of resource efficiency. Specifically, we first extract pivotal logical reasoning clues to construct positive and adversarial negative instances for contrastive learning. We then propose a new training paradigm that introduces a contrastive learning objective, encouraging the model to recognize noise invariance between original and positive instances while also distinguishing logical differences between original and negative instances. Extensive experimental results on three widely studied datasets TABFACT, INFOTABS and SEM-TAB-FACTS demonstrate that Macol achieves state-of-the-art performance on benchmarks across various backbone architectures, with accuracy improvements reaching up to 5.4%. Furthermore, Macol exhibits significant advantages in robustness and low-data resource scenarios, with improvements of up to 8.2% and 9.1%. It is worth noting that our method achieves comparable or even superior performance while being more resource-efficient compared to approaches that employ specific additional pre-training or utilize large language models (LLMs).

基于表格的事实验证侧重于通过交叉引用表格中的数据来确定语句的真实性。由于表格结构中固有的复杂交互,这项任务极具挑战性。为应对这一挑战,现有方法设计了一系列专门模型。虽然这些模型表现出良好的性能,但它们对语句和表格中与推理相关的信息的基本变化仍然表现出有限的敏感性,从而学习到虚假的模式,导致可能不可靠的结果。在这项工作中,我们提出了一种名为 "多视角对抗学习(Macol)"的新方法,旨在资源效率的前提下提高基于表格的事实验证系统的准确性和鲁棒性。具体来说,我们首先提取关键的逻辑推理线索,构建正反两方面的负面实例,用于对比学习。然后,我们提出了一种新的训练范式,引入对比学习目标,鼓励模型识别原始实例和正面实例之间的噪声不变性,同时区分原始实例和负面实例之间的逻辑差异。在三个广泛研究的数据集 TABFACT、INFOTABS 和 SEM-TAB-FACTS 上取得的大量实验结果表明,Macol 在各种骨干架构的基准测试中都取得了一流的性能,准确率最高提高了 5.4%。此外,Macol 在鲁棒性和低数据资源情况下表现出显著优势,分别提高了 8.2% 和 9.1%。值得注意的是,与采用特定的额外预训练或利用大型语言模型(LLM)的方法相比,我们的方法在更节省资源的情况下实现了相当甚至更优越的性能。
{"title":"Robust and resource-efficient table-based fact verification through multi-aspect adversarial contrastive learning","authors":"Ruiheng Liu ,&nbsp;Yu Zhang ,&nbsp;Bailong Yang ,&nbsp;Qi Shi ,&nbsp;Luogeng Tian","doi":"10.1016/j.ipm.2024.103853","DOIUrl":"10.1016/j.ipm.2024.103853","url":null,"abstract":"<div><p>Table-based fact verification focuses on determining the truthfulness of statements by cross-referencing data in tables. This task is challenging due to the complex interactions inherent in table structures. To address this challenge, existing methods have devised a range of specialized models. Although these models demonstrate good performance, they still exhibit limited sensitivity to essential variations in information relevant to reasoning within both statements and tables, thus learning spurious patterns and leading to potentially unreliable outcomes. In this work, we propose a novel approach named <strong>M</strong>ulti-Aspect <strong>A</strong>dversarial <strong>Co</strong>ntrastive <strong>L</strong>earning (<span>Macol</span>), aimed at enhancing the accuracy and robustness of table-based fact verification systems under the premise of resource efficiency. Specifically, we first extract pivotal logical reasoning clues to construct positive and adversarial negative instances for contrastive learning. We then propose a new training paradigm that introduces a contrastive learning objective, encouraging the model to recognize noise invariance between original and positive instances while also distinguishing logical differences between original and negative instances. Extensive experimental results on three widely studied datasets TABFACT, INFOTABS and SEM-TAB-FACTS demonstrate that <span>Macol</span> achieves state-of-the-art performance on benchmarks across various backbone architectures, with accuracy improvements reaching up to 5.4%. Furthermore, <span>Macol</span> exhibits significant advantages in robustness and low-data resource scenarios, with improvements of up to 8.2% and 9.1%. It is worth noting that our method achieves comparable or even superior performance while being more resource-efficient compared to approaches that employ specific additional pre-training or utilize large language models (LLMs).</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"61 6","pages":"Article 103853"},"PeriodicalIF":7.4,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141979786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic Feature Focusing Network for small object detection 用于小物体检测的动态特征聚焦网络
IF 7.4 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-12 DOI: 10.1016/j.ipm.2024.103858
Rudong Jing , Wei Zhang , Yuzhuo Li , Wenlin Li , Yanyan Liu

Deep learning has driven research in object detection and achieved proud results. Despite its significant advancements in object detection, small object detection still struggles with low recognition rates and inaccurate positioning, primarily attributable to their miniature size. The location deviation of small objects induces severe feature misalignment, and the disequilibrium between classification and regression tasks hinders accurate recognition. To address these issues, we propose a Dynamic Feature Focusing Network (DFFN), which contains a duo of crucial modules: Visual Perception Enhancement Module (VPEM) and Task Association Module (TAM). Drawing upon the deformable convolution and attention mechanism, the VPEM concentrates on sparse key features and perceives the misalignment via positional offset. We aggregate multi-level features at identical spatial locations via layer average operation for learning a more discriminative representation. Incorporating class alignment and bounding box alignment parts, the TAM promotes classification ability, refines bounding box regression, and facilitates the joint learning of classification and localization. We conduct diverse experiments, and the proposed method considerably enhances the small object detection performance on four benchmark datasets of MS COCO, VisDrone, VOC, and TinyPerson. Our method has improved by 3.4 and 2.2 in mAP and APs, making solid improvements on COCO. Compared to other classic detection models, DFFN exhibits a high level of competitiveness in precision.

深度学习推动了物体检测领域的研究,并取得了令人骄傲的成果。尽管深度学习在物体检测领域取得了长足进步,但小物体检测仍然存在识别率低、定位不准确等问题,这主要归因于小物体的尺寸太小。小物体的位置偏差会导致严重的特征错位,分类和回归任务之间的不平衡也会阻碍准确识别。为了解决这些问题,我们提出了一种动态特征聚焦网络(DFFN),它包含两个关键模块:视觉感知增强模块(VPEM)和任务关联模块(TAM)。借助可变形卷积和注意力机制,VPEM 专注于稀疏的关键特征,并通过位置偏移感知错位。我们通过层平均运算将相同空间位置的多层次特征聚合在一起,以学习更具区分性的表征。结合类对齐和边界框对齐部分,TAM 提高了分类能力,完善了边界框回归,促进了分类和定位的联合学习。我们进行了多样化的实验,结果表明,在 MS COCO、VisDrone、VOC 和 TinyPerson 四个基准数据集上,所提出的方法大大提高了小目标检测性能。我们的方法在 mAP 和 AP 上分别提高了 3.4 和 2.2,在 COCO 上也有显著提高。与其他经典检测模型相比,DFFN 在精确度方面具有很高的竞争力。
{"title":"Dynamic Feature Focusing Network for small object detection","authors":"Rudong Jing ,&nbsp;Wei Zhang ,&nbsp;Yuzhuo Li ,&nbsp;Wenlin Li ,&nbsp;Yanyan Liu","doi":"10.1016/j.ipm.2024.103858","DOIUrl":"10.1016/j.ipm.2024.103858","url":null,"abstract":"<div><p>Deep learning has driven research in object detection and achieved proud results. Despite its significant advancements in object detection, small object detection still struggles with low recognition rates and inaccurate positioning, primarily attributable to their miniature size. The location deviation of small objects induces severe feature misalignment, and the disequilibrium between classification and regression tasks hinders accurate recognition. To address these issues, we propose a Dynamic Feature Focusing Network (DFFN), which contains a duo of crucial modules: Visual Perception Enhancement Module (VPEM) and Task Association Module (TAM). Drawing upon the deformable convolution and attention mechanism, the VPEM concentrates on sparse key features and perceives the misalignment via positional offset. We aggregate multi-level features at identical spatial locations via layer average operation for learning a more discriminative representation. Incorporating class alignment and bounding box alignment parts, the TAM promotes classification ability, refines bounding box regression, and facilitates the joint learning of classification and localization. We conduct diverse experiments, and the proposed method considerably enhances the small object detection performance on four benchmark datasets of MS COCO, VisDrone, VOC, and TinyPerson. Our method has improved by 3.4 and 2.2 in mAP and AP<em>s</em>, making solid improvements on COCO. Compared to other classic detection models, DFFN exhibits a high level of competitiveness in precision.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"61 6","pages":"Article 103858"},"PeriodicalIF":7.4,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141979793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OBCTeacher: Resisting labeled data scarcity in oracle bone character detection by semi-supervised learning OBCTeacher:通过半监督学习抵御甲骨文字检测中的标记数据匮乏问题
IF 7.4 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-07 DOI: 10.1016/j.ipm.2024.103864
Xiuan Wan , Zhengchen Li , Dandan Liang , Shouyong Pan , Yuchun Fang

Oracle bone characters (OBCs) are ancient ideographs for divination and memorization, as well as first-hand evidence of ancient Chinese culture. The detection of OBC is the premise of advanced studies and was mainly done by authoritative experts in the past. Deep learning techniques have great potential to facilitate OBC detection, but the high annotation cost of OBC brings the scarcity of labeled data, hindering its application. This paper proposes a novel OBC detection framework called OBCTeacher based on semi-supervised learning (SSL) to resist labeled data scarcity. We first construct a large-scale OBC detection dataset. Through investigation, we find that spatial mismatching and class imbalance problems lead to decreased positive anchors and biased predictions, affecting the quality of pseudo labels and the performance of OBC detection. To mitigate the spatial mismatching problem, we introduce a geometric-priori-based anchor assignment strategy and a heatmap polishing procedure to increase positive anchors and improve the quality of pseudo labels. As for the class imbalance problem, we propose a re-weighting method based on estimated class information and a contrastive anchor loss to achieve prioritized learning on different OBC classes and better class boundaries. We evaluate our method by using only a small portion of labeled data while using the remaining data as unlabeled and all labeled data with extra unlabeled data. The results demonstrate the effectiveness of our method compared with other state-of-the-art methods by superior performance and significant improvements of an average of 11.97 in AP50:95 against the only supervised baseline. In addition, our method achieves comparable performance using only 20% of labeled data to the fully-supervised baseline using 100% of labeled data, demonstrating that our method significantly reduces the dependence on labeled data for OBC detection.

甲骨文是古代用于占卜和记忆的表意文字,也是中国古代文化的第一手证据。甲骨文的检测是高级研究的前提,过去主要由权威专家完成。深度学习技术在促进OBC检测方面潜力巨大,但OBC的高标注成本带来了标注数据的稀缺,阻碍了其应用。本文提出了一种基于半监督学习(SSL)的新型 OBC 检测框架--OBCTeacher,以克服标记数据稀缺的问题。我们首先构建了一个大规模的 OBC 检测数据集。通过研究,我们发现空间不匹配和类不平衡问题会导致正锚减少和预测偏差,影响伪标签的质量和 OBC 检测的性能。为了缓解空间不匹配问题,我们引入了基于几何先验的锚点分配策略和热图抛光程序,以增加正锚点,提高伪标签的质量。至于类不平衡问题,我们提出了一种基于估计类信息和对比锚损失的重新加权方法,以实现对不同 OBC 类的优先学习和更好的类边界。我们评估了我们的方法,只使用了一小部分已标注数据,而将其余数据作为未标注数据和所有已标注数据与额外的未标注数据一起使用。结果表明,与其他最先进的方法相比,我们的方法性能优越、效果显著,与唯一的监督基线相比,平均提高了 11.97%。此外,我们的方法只使用了 20% 的标记数据,就取得了与使用 100% 标记数据的完全监督基线相当的性能,这表明我们的方法大大降低了 OBC 检测对标记数据的依赖。
{"title":"OBCTeacher: Resisting labeled data scarcity in oracle bone character detection by semi-supervised learning","authors":"Xiuan Wan ,&nbsp;Zhengchen Li ,&nbsp;Dandan Liang ,&nbsp;Shouyong Pan ,&nbsp;Yuchun Fang","doi":"10.1016/j.ipm.2024.103864","DOIUrl":"10.1016/j.ipm.2024.103864","url":null,"abstract":"<div><p>Oracle bone characters (OBCs) are ancient ideographs for divination and memorization, as well as first-hand evidence of ancient Chinese culture. The detection of OBC is the premise of advanced studies and was mainly done by authoritative experts in the past. Deep learning techniques have great potential to facilitate OBC detection, but the high annotation cost of OBC brings the scarcity of labeled data, hindering its application. This paper proposes a novel OBC detection framework called OBCTeacher based on semi-supervised learning (SSL) to resist labeled data scarcity. We first construct a large-scale OBC detection dataset. Through investigation, we find that spatial mismatching and class imbalance problems lead to decreased positive anchors and biased predictions, affecting the quality of pseudo labels and the performance of OBC detection. To mitigate the spatial mismatching problem, we introduce a geometric-priori-based anchor assignment strategy and a heatmap polishing procedure to increase positive anchors and improve the quality of pseudo labels. As for the class imbalance problem, we propose a re-weighting method based on estimated class information and a contrastive anchor loss to achieve prioritized learning on different OBC classes and better class boundaries. We evaluate our method by using only a small portion of labeled data while using the remaining data as unlabeled and all labeled data with extra unlabeled data. The results demonstrate the effectiveness of our method compared with other state-of-the-art methods by superior performance and significant improvements of an average of 11.97 in <span><math><mrow><mi>A</mi><msub><mrow><mi>P</mi></mrow><mrow><mn>50</mn><mo>:</mo><mn>95</mn></mrow></msub></mrow></math></span> against the only supervised baseline. In addition, our method achieves comparable performance using only 20% of labeled data to the fully-supervised baseline using 100% of labeled data, demonstrating that our method significantly reduces the dependence on labeled data for OBC detection.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"61 6","pages":"Article 103864"},"PeriodicalIF":7.4,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141944143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Get by how much you pay: A novel data pricing scheme for data trading 付出多少,收获多少:数据交易的新型数据定价方案
IF 7.4 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-07 DOI: 10.1016/j.ipm.2024.103849
Yu Lu , Jingyu Wang , Lixin Liu , Hanqing Yang

As a crucial step in promoting data sharing, data trading can stimulate the development of the data economy. However, the current data trading market primarily focuses on satisfying data owners' interests, overlooking the demands of data requesters. Ignoring the demands of data requesters may lead to a loss of market competitiveness, customer loss, and missed business opportunities while damaging reputation and innovation capabilities. Therefore, in this paper, we introduce a novel pricing mechanism named Get By How Much You Pay (GHMP) based on compressed sensing technology and game theory to address pricing problems according to data requesters' demands. This scheme employs a dictionary matrix as the sparse basis matrix in compressed sensing. The quality of this matrix directly affects the precision with which the requester can reconstruct the data. If the requester requires higher-precision data, the corresponding payment will also increase accordingly so as to realize the pricing method based on the requester's demands. A game pricing method is proposed to address the final pricing and purchasing issues between the data requester and the data owner by utilizing an authorized smart contract as an intermediary. As an entity participating in the game, the smart contract can only receive a higher transaction fee if it successfully assists the data requester and data owner in completing the pricing. Therefore, it strives to establish more reasonable prices for both parties during the trading process to obtain profits. The experimental results demonstrate that this game-based approach assists the data requester and owner in achieving optimal data pricing, thereby satisfying the maximization of interests for both parties.

作为促进数据共享的关键一步,数据交易可以刺激数据经济的发展。然而,当前的数据交易市场主要侧重于满足数据拥有者的利益,忽视了数据需求者的需求。忽视数据需求者的需求,可能会导致市场竞争力下降、客户流失、错失商机,同时损害声誉和创新能力。因此,本文基于压缩传感技术和博弈论,提出了一种名为 "按需付费"(Get By How Much You Pay,GHMP)的新型定价机制,以解决根据数据请求者需求定价的问题。该方案采用字典矩阵作为压缩传感的稀疏基础矩阵。该矩阵的质量直接影响请求者重建数据的精度。如果请求者需要更高精度的数据,相应的付费也会相应提高,从而实现基于请求者需求的定价方法。本文提出了一种游戏定价方法,通过授权智能合约作为中介,解决数据请求者和数据所有者之间的最终定价和购买问题。作为参与博弈的实体,智能合约只有成功协助数据请求者和数据拥有者完成定价,才能获得更高的交易费用。因此,它在交易过程中努力为双方制定更合理的价格,以获取利润。实验结果表明,这种基于博弈的方法可以协助数据请求者和数据拥有者实现最优数据定价,从而满足双方利益最大化的要求。
{"title":"Get by how much you pay: A novel data pricing scheme for data trading","authors":"Yu Lu ,&nbsp;Jingyu Wang ,&nbsp;Lixin Liu ,&nbsp;Hanqing Yang","doi":"10.1016/j.ipm.2024.103849","DOIUrl":"10.1016/j.ipm.2024.103849","url":null,"abstract":"<div><p>As a crucial step in promoting data sharing, data trading can stimulate the development of the data economy. However, the current data trading market primarily focuses on satisfying data owners' interests, overlooking the demands of data requesters. Ignoring the demands of data requesters may lead to a loss of market competitiveness, customer loss, and missed business opportunities while damaging reputation and innovation capabilities. Therefore, in this paper, we introduce a novel pricing mechanism named Get By How Much You Pay (GHMP) based on compressed sensing technology and game theory to address pricing problems according to data requesters' demands. This scheme employs a dictionary matrix as the sparse basis matrix in compressed sensing. The quality of this matrix directly affects the precision with which the requester can reconstruct the data. If the requester requires higher-precision data, the corresponding payment will also increase accordingly so as to realize the pricing method based on the requester's demands. A game pricing method is proposed to address the final pricing and purchasing issues between the data requester and the data owner by utilizing an authorized smart contract as an intermediary. As an entity participating in the game, the smart contract can only receive a higher transaction fee if it successfully assists the data requester and data owner in completing the pricing. Therefore, it strives to establish more reasonable prices for both parties during the trading process to obtain profits. The experimental results demonstrate that this game-based approach assists the data requester and owner in achieving optimal data pricing, thereby satisfying the maximization of interests for both parties.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"61 6","pages":"Article 103849"},"PeriodicalIF":7.4,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141944158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The interaction of inter-organizational diversity and team size, and the scientific impact of papers 组织间多样性与团队规模的相互作用以及论文的科学影响力
IF 7.4 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-06 DOI: 10.1016/j.ipm.2024.103851
Hyoung Sun Yoo , Ye Lim Jung , June Young Lee , Chul Lee

Large teams are known to be more likely to publish highly cited papers, while small teams are known to be better at publishing highly disruptive papers. However, there is a lack of adequate theoretical understanding of the mechanisms by which scientific collaboration among researchers is related to the scientific impact of their papers. We investigated the mechanisms more closely by focusing on the interaction of inter-organizational diversity and team size in the process of team formation and knowledge dissemination. We analyzed 12,010,102 Web of Science papers and examined how inter-organizational diversity is associated with the relationship of team size with disruption and citations. As a result, we found that not only small teams, but also large teams with great inter-organizational diversity were able to disrupt science and technology effectively. We also found that large teams with greater inter-organizational diversity were more likely to produce highly cited papers. Our findings are robust and consistently observed regardless of publication year, team size, the number of references, and the degree of multidisciplinarity. These results have significant implications for researchers in selecting collaborators to achieve greater impact and for improving the qualitative efficiency of public research investments.

众所周知,大团队更有可能发表高引用率的论文,而小团队则更擅长发表高颠覆性的论文。然而,对于科研人员之间的科研合作与其论文的科学影响力之间的关系机制,我们还缺乏足够的理论认识。我们通过关注团队组建和知识传播过程中组织间多样性和团队规模的相互作用,对这一机制进行了更深入的研究。我们分析了12,010,102篇Web of Science论文,研究了组织间多样性与团队规模与干扰和引用之间的关系。结果我们发现,不仅小团队,而且组织间多样性强的大团队都能有效地颠覆科学技术。我们还发现,组织间多样性更强的大型团队更有可能产生高引用率的论文。无论论文发表年份、团队规模、参考文献数量和多学科程度如何,我们的研究结果都是稳健的,并能持续观察到。这些结果对研究人员选择合作者以产生更大影响以及提高公共研究投资的质量效率具有重要意义。
{"title":"The interaction of inter-organizational diversity and team size, and the scientific impact of papers","authors":"Hyoung Sun Yoo ,&nbsp;Ye Lim Jung ,&nbsp;June Young Lee ,&nbsp;Chul Lee","doi":"10.1016/j.ipm.2024.103851","DOIUrl":"10.1016/j.ipm.2024.103851","url":null,"abstract":"<div><p>Large teams are known to be more likely to publish highly cited papers, while small teams are known to be better at publishing highly disruptive papers. However, there is a lack of adequate theoretical understanding of the mechanisms by which scientific collaboration among researchers is related to the scientific impact of their papers. We investigated the mechanisms more closely by focusing on the interaction of inter-organizational diversity and team size in the process of team formation and knowledge dissemination. We analyzed 12,010,102 Web of Science papers and examined how inter-organizational diversity is associated with the relationship of team size with disruption and citations. As a result, we found that not only small teams, but also large teams with great inter-organizational diversity were able to disrupt science and technology effectively. We also found that large teams with greater inter-organizational diversity were more likely to produce highly cited papers. Our findings are robust and consistently observed regardless of publication year, team size, the number of references, and the degree of multidisciplinarity. These results have significant implications for researchers in selecting collaborators to achieve greater impact and for improving the qualitative efficiency of public research investments.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"61 6","pages":"Article 103851"},"PeriodicalIF":7.4,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306457324002103/pdfft?md5=24632fc550985135ce9f8be93795f4b2&pid=1-s2.0-S0306457324002103-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141944159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MOOCs video recommendation using low-rank and sparse matrix factorization with inter-entity relations and intra-entity affinity information 利用具有实体间关系和实体内亲和力信息的低秩稀疏矩阵因式分解进行 MOOC 视频推荐
IF 7.4 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-06 DOI: 10.1016/j.ipm.2024.103861
Yunmei Gao

Purpose

The serious information overload problem of MOOCs videos decreases the learning efficiency of the students and the utilization rate of the videos. There are two problems worthy of attention for the matrix factorization (MF)-based video learning resource recommender systems. Those methods suffer from the sparsity problem of the user-item rating matrix, while side information about user and item is seldom used to guide the learning procedure of the MF.

Method

To address those two problems, we proposed a new MOOCs video resource recommender LSMFERLI based on Low-rank and Sparse Matrix Factorization (LSMF) with the guidance of the inter-Entity Relations and intra-entity Latent Information of the students and videos. Firstly, we construct the inter-entity relation matrices and intra-entity latent preference matrix for the students. Secondly, we construct the inter-entity relation matrices and intra-entity affinity matrix for the videos. Lastly, with the guidance of the inter-entity relation and intra-entity affinity matrices of the students and videos, the student-video rating matrix is factorized into a low-rank matrix and a sparse matrix by the alternative iteration optimization scheme.

Conclusions

Experimental results on dataset MOOCcube indicate that LSMFERLI outperforms 7 state-of-the-art methods in terms of the HR@K and NDCG@K(K = 5,10,15) indicators increased by an average of 20.6 % and 21.0 %, respectively.

MOOCs 视频存在严重的信息过载问题,降低了学生的学习效率和视频的利用率。基于矩阵因式分解(MF)的视频学习资源推荐系统有两个问题值得关注。这些方法存在用户-项目评分矩阵的稀疏性问题,而用户和项目的侧面信息很少用于指导矩阵因式分解的学习过程。针对这两个问题,我们提出了一种基于低秩稀疏矩阵因式分解(LSMF)、以学生和视频的实体间关系和实体内潜在信息为指导的新型 MOOCs 视频资源推荐器 LSMFERLI。首先,我们构建学生的实体间关系矩阵和实体内潜在偏好矩阵。其次,我们构建视频的实体间关系矩阵和实体内亲和矩阵。最后,在学生和视频的实体间关系矩阵和实体内亲和矩阵的指导下,通过替代迭代优化方案将学生-视频评分矩阵因式分解为低秩矩阵和稀疏矩阵。在数据集 MOOCcube 上的实验结果表明,LSMFERLI 在 HR@ 和 NDCG@( = 5,10,15) 指标上优于 7 种最先进的方法,平均增幅分别为 20.6% 和 21.0%。
{"title":"MOOCs video recommendation using low-rank and sparse matrix factorization with inter-entity relations and intra-entity affinity information","authors":"Yunmei Gao","doi":"10.1016/j.ipm.2024.103861","DOIUrl":"10.1016/j.ipm.2024.103861","url":null,"abstract":"<div><h3>Purpose</h3><p>The serious information overload problem of MOOCs videos decreases the learning efficiency of the students and the utilization rate of the videos. There are two problems worthy of attention for the matrix factorization (MF)-based video learning resource recommender systems. Those methods suffer from the sparsity problem of the user-item rating matrix, while side information about user and item is seldom used to guide the learning procedure of the MF.</p></div><div><h3>Method</h3><p>To address those two problems, we proposed a new MOOCs video resource recommender LSMFERLI based on Low-rank and Sparse Matrix Factorization (LSMF) with the guidance of the inter-Entity Relations and intra-entity Latent Information of the students and videos. Firstly, we construct the inter-entity relation matrices and intra-entity latent preference matrix for the students. Secondly, we construct the inter-entity relation matrices and intra-entity affinity matrix for the videos. Lastly, with the guidance of the inter-entity relation and intra-entity affinity matrices of the students and videos, the student-video rating matrix is factorized into a low-rank matrix and a sparse matrix by the alternative iteration optimization scheme.</p></div><div><h3>Conclusions</h3><p>Experimental results on dataset MOOCcube indicate that LSMFERLI outperforms 7 state-of-the-art methods in terms of the HR@<em>K</em> and NDCG@<em>K</em>(<em>K</em> = 5,10,15) indicators increased by an average of 20.6 % and 21.0 %, respectively.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"61 6","pages":"Article 103861"},"PeriodicalIF":7.4,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306457324002206/pdfft?md5=308b736cfd63725fb5781fb48c9b85f3&pid=1-s2.0-S0306457324002206-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141944160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A framework for predicting scientific disruption based on graph signal processing 基于图信号处理的科学干扰预测框架
IF 7.4 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-05 DOI: 10.1016/j.ipm.2024.103863
Houqiang Yu, Yian Liang

Identifying scientific disruption is consistently recognized as challenging, and more so is to predict it. We suggest that better predictions are hindered by the inability to integrate multidimensional information and the limited scalability of existing methods. This paper develops a framework based on graph signal processing (GSP) to predict scientific disruption, achieving an average AUC of about 80 % on benchmark datasets, surpassing the performance of prior methods by 13.6 % on average. The framework is unified, adaptable to any type of information, and scalable, with the potential for further enhancements using technologies from GSP. The intuition of this framework is: scientific disruption is characterized by leading to dramatic changes in scientific evolution, which is recognized as a complex system represented by a graph, and GSP is a technique that specializes in analyzing data on graph structures; thus, we argue that GSP is well-suited for modeling scientific evolution and predicting disruption. Based on this proposed framework, we proceed with disruption predictions. The content, context, and (citation) structure information is respectively defined as graph signals. The total variations of these graph signals, which measure the evolutionary amplitude, are the main predictors. To illustrate the unity and scalability of our framework, altmetrics data (online mentions of the paper) that seldom considered previously is defined as graph signal, and another indicator, the dispersion entropy of graph signal (measuring chaos of scientific evolution), is used for predicting respectively. Our framework also provides advantages of interpretability for a better understanding on scientific disruption. The analysis indicates that the scientific disruption not only results in dramatic changes in the knowledge content, but also in context (e.g., journals and authors), and will lead to chaos in subsequent evolution. At last, several practical future directions for disruption predictions based on the framework are proposed.

识别科学混乱一直被认为是一项挑战,而预测科学混乱更是如此。我们认为,无法整合多维信息以及现有方法的可扩展性有限阻碍了更好的预测。本文开发了一个基于图信号处理(GSP)的框架来预测科学干扰,在基准数据集上实现了约 80% 的平均 AUC,平均超出先前方法 13.6% 的性能。该框架是统一的,可适应任何类型的信息,并具有可扩展性,有可能利用 GSP 技术进一步增强。该框架的直觉是:科学颠覆的特点是导致科学进化的巨大变化,而科学进化被认为是一个由图表示的复杂系统,而 GSP 是一种专门分析图结构数据的技术;因此,我们认为 GSP 非常适合科学进化建模和预测颠覆。在此框架基础上,我们开始进行破坏预测。内容、上下文和(引用)结构信息分别被定义为图信号。这些图信号的总变化是主要的预测指标,可以衡量进化幅度。为了说明我们框架的统一性和可扩展性,我们将以前很少考虑的 altmetrics 数据(论文的在线提及)定义为图信号,并使用另一个指标--图信号的分散熵(衡量科学进化的混乱程度)--分别进行预测。我们的框架还具有可解释性强的优势,有助于更好地理解科学混乱现象。分析表明,科学中断不仅会导致知识内容的巨大变化,还会导致上下文(如期刊和作者)的巨大变化,并将导致后续演化的混乱。最后,还提出了基于该框架的干扰预测的几个实用的未来方向。
{"title":"A framework for predicting scientific disruption based on graph signal processing","authors":"Houqiang Yu,&nbsp;Yian Liang","doi":"10.1016/j.ipm.2024.103863","DOIUrl":"10.1016/j.ipm.2024.103863","url":null,"abstract":"<div><p>Identifying scientific disruption is consistently recognized as challenging, and more so is to predict it. We suggest that better predictions are hindered by the inability to integrate multidimensional information and the limited scalability of existing methods. This paper develops a framework based on graph signal processing (GSP) to predict scientific disruption, achieving an average AUC of about 80 % on benchmark datasets, surpassing the performance of prior methods by 13.6 % on average. The framework is unified, adaptable to any type of information, and scalable, with the potential for further enhancements using technologies from GSP. The intuition of this framework is: scientific disruption is characterized by leading to dramatic changes in scientific evolution, which is recognized as a complex system represented by a graph, and GSP is a technique that specializes in analyzing data on graph structures; thus, we argue that GSP is well-suited for modeling scientific evolution and predicting disruption. Based on this proposed framework, we proceed with disruption predictions. The content, context, and (citation) structure information is respectively defined as graph signals. The total variations of these graph signals, which measure the evolutionary amplitude, are the main predictors. To illustrate the unity and scalability of our framework, altmetrics data (online mentions of the paper) that seldom considered previously is defined as graph signal, and another indicator, the dispersion entropy of graph signal (measuring chaos of scientific evolution), is used for predicting respectively. Our framework also provides advantages of interpretability for a better understanding on scientific disruption. The analysis indicates that the scientific disruption not only results in dramatic changes in the knowledge content, but also in context (e.g., journals and authors), and will lead to chaos in subsequent evolution. At last, several practical future directions for disruption predictions based on the framework are proposed.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"61 6","pages":"Article 103863"},"PeriodicalIF":7.4,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141944168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evolutions of semantic consistency in research topic via contextualized word embedding 通过语境化词语嵌入实现研究课题语义一致性的演变
IF 7.4 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-03 DOI: 10.1016/j.ipm.2024.103859
Shengzhi Huang , Wei Lu , Qikai Cheng , Zhuoran Luo , Yong Huang

Topic evolution has been studied extensively in the field of the science of science. This study first analyzes topic evolution pattern from topics’ semantic consistency in the semantic vector space, and explore its possible causes. Specifically, we extract papers in the computer science field from Microsoft Academic Graph as our dataset. We propose a novel method for encoding a topic with numerous Contextualized Word Embeddings (CWE), in which the title and abstract fields of papers studying the topic is taken as its context. Subsequently, we employ three geometric metrics to analyze topics’ semantic consistency over time, from which the influence of the anisotropy of CWE is excluded. The K-Means clustering algorithm is employed to identify four general evolution patterns of semantic consistency, that is, semantic consistency increases (IM), decreases (DM), increases first and then decreases (Inverted U-shape), and decreases first and then increases (U-shape). We also find that research methods tend to show DM and U-shape, but research questions tend to be IM and Inverted U-shape. Finally, we further utilize the regression analysis to explore whether and, if so, how a series of key features of a topic affect its semantic consistency. Importantly, semantic consistency of a topic varies inversely with the semantic similarity between the topic and other topics. Overall, this study sheds light on the evolution law of topics, and helps researchers to understand these patterns from a geometric perspective.

在科学领域,话题演变已被广泛研究。本研究首先从主题在语义向量空间中的语义一致性分析主题演变模式,并探讨其可能的原因。具体来说,我们从微软学术图谱(Microsoft Academic Graph)中提取计算机科学领域的论文作为数据集。我们提出了一种用大量上下文词嵌入(CWE)对主题进行编码的新方法,其中研究该主题的论文的标题和摘要字段被视为其上下文。随后,我们采用三种几何度量来分析话题在一段时间内的语义一致性,其中排除了 CWE 各向异性的影响。通过 K-Means 聚类算法,我们发现了语义一致性的四种一般演变模式,即语义一致性增加(IM)、减少(DM)、先增加后减少(倒 U 型)和先减少后增加(U 型)。我们还发现,研究方法倾向于呈现 DM 和 U 型,但研究问题倾向于呈现 IM 和倒 U 型。最后,我们进一步利用回归分析来探讨一个主题的一系列关键特征是否会影响其语义一致性,如果会,又是如何影响的。重要的是,一个话题的语义一致性与该话题和其他话题之间的语义相似性成反比变化。总之,这项研究揭示了话题的演变规律,有助于研究人员从几何学的角度理解这些规律。
{"title":"Evolutions of semantic consistency in research topic via contextualized word embedding","authors":"Shengzhi Huang ,&nbsp;Wei Lu ,&nbsp;Qikai Cheng ,&nbsp;Zhuoran Luo ,&nbsp;Yong Huang","doi":"10.1016/j.ipm.2024.103859","DOIUrl":"10.1016/j.ipm.2024.103859","url":null,"abstract":"<div><p>Topic evolution has been studied extensively in the field of the science of science. This study first analyzes topic evolution pattern from topics’ semantic consistency in the semantic vector space, and explore its possible causes. Specifically, we extract papers in the computer science field from Microsoft Academic Graph as our dataset. We propose a novel method for encoding a topic with numerous Contextualized Word Embeddings (CWE), in which the title and abstract fields of papers studying the topic is taken as its context. Subsequently, we employ three geometric metrics to analyze topics’ semantic consistency over time, from which the influence of the anisotropy of CWE is excluded. The K-Means clustering algorithm is employed to identify four general evolution patterns of semantic consistency, that is, semantic consistency increases (IM), decreases (DM), increases first and then decreases (Inverted U-shape), and decreases first and then increases (U-shape). We also find that research methods tend to show DM and U-shape, but research questions tend to be IM and Inverted U-shape. Finally, we further utilize the regression analysis to explore whether and, if so, how a series of key features of a topic affect its semantic consistency. Importantly, semantic consistency of a topic varies inversely with the semantic similarity between the topic and other topics. Overall, this study sheds light on the evolution law of topics, and helps researchers to understand these patterns from a geometric perspective.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"61 6","pages":"Article 103859"},"PeriodicalIF":7.4,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141944161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Information Processing & Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1