INCEPT:利用组合文本表示法进行重复帖子分类的框架

IF 4.7 Q2 MATERIALS SCIENCE, BIOMATERIALS ACS Applied Bio Materials Pub Date : 2024-07-15 DOI:10.1145/3677322
Erjon Skenderi, Jukka Huhtamäki, Salla-Maaria Laaksonen, Kostas Stefanidis
{"title":"INCEPT:利用组合文本表示法进行重复帖子分类的框架","authors":"Erjon Skenderi, Jukka Huhtamäki, Salla-Maaria Laaksonen, Kostas Stefanidis","doi":"10.1145/3677322","DOIUrl":null,"url":null,"abstract":"Dealing with many of the problems related to the quality of textual content online involves identifying similar content. Algorithmic solutions for duplicate content classification typically rely on text vector representation, which maps textual information into a set of features. Ideally, this representation would capture all aspects of the underlying text, including length, word frequencies, syntax, and semantics. While recent advancements in text representation have led to improved performance, a comprehensive approach that explicitly incorporates all text features has not yet been proposed. In this study, we present the INCEPT framework that utilizes multiple representation methods to detect duplicate text pairs, taking advantage of their individual strengths. The core of our approach involves using a stacking ensemble of pairwise vector distance measurements that are computed from multiple text representation methods. A stacking classifier then utilizes these distance scores as input and learns to identify duplicate posts. We assess the proposed framework’s effectiveness in identifying duplicate posts in an online Question and Answer platform. By combining several text representation methods, INCEPT performs well in the duplicate posts classification task. Our experiments demonstrate that specific framework configurations outperform the accuracy scores obtained from individual text representation methods. Therefore, we also infer that no single text representation method can independently capture a text’s features.","PeriodicalId":2,"journal":{"name":"ACS Applied Bio Materials","volume":"27 9","pages":""},"PeriodicalIF":4.7000,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"INCEPT: A Framework for Duplicate Posts Classification with Combined Text Representations\",\"authors\":\"Erjon Skenderi, Jukka Huhtamäki, Salla-Maaria Laaksonen, Kostas Stefanidis\",\"doi\":\"10.1145/3677322\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Dealing with many of the problems related to the quality of textual content online involves identifying similar content. Algorithmic solutions for duplicate content classification typically rely on text vector representation, which maps textual information into a set of features. Ideally, this representation would capture all aspects of the underlying text, including length, word frequencies, syntax, and semantics. While recent advancements in text representation have led to improved performance, a comprehensive approach that explicitly incorporates all text features has not yet been proposed. In this study, we present the INCEPT framework that utilizes multiple representation methods to detect duplicate text pairs, taking advantage of their individual strengths. The core of our approach involves using a stacking ensemble of pairwise vector distance measurements that are computed from multiple text representation methods. A stacking classifier then utilizes these distance scores as input and learns to identify duplicate posts. We assess the proposed framework’s effectiveness in identifying duplicate posts in an online Question and Answer platform. By combining several text representation methods, INCEPT performs well in the duplicate posts classification task. Our experiments demonstrate that specific framework configurations outperform the accuracy scores obtained from individual text representation methods. Therefore, we also infer that no single text representation method can independently capture a text’s features.\",\"PeriodicalId\":2,\"journal\":{\"name\":\"ACS Applied Bio Materials\",\"volume\":\"27 9\",\"pages\":\"\"},\"PeriodicalIF\":4.7000,\"publicationDate\":\"2024-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Applied Bio Materials\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3677322\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MATERIALS SCIENCE, BIOMATERIALS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Bio Materials","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3677322","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATERIALS SCIENCE, BIOMATERIALS","Score":null,"Total":0}
引用次数: 0

摘要

处理与在线文本内容质量相关的许多问题都涉及到识别相似内容。重复内容分类的算法解决方案通常依赖于文本向量表示法,它将文本信息映射到一组特征中。在理想情况下,这种表示法可以捕捉基础文本的所有方面,包括长度、词频、语法和语义。虽然文本表示法的最新进展提高了性能,但明确包含所有文本特征的综合方法尚未提出。在本研究中,我们提出了 INCEPT 框架,该框架利用多种表示方法来检测重复文本对,从而发挥它们各自的优势。我们的方法的核心包括使用由多种文本表示方法计算出的成对向量距离测量的堆叠集合。然后,堆叠分类器利用这些距离分数作为输入,学习如何识别重复的帖子。我们评估了所提出的框架在一个在线问答平台中识别重复帖子的有效性。通过结合多种文本表示方法,INCEPT 在重复帖子分类任务中表现出色。我们的实验证明,特定的框架配置比单个文本表示方法获得的准确率得分更高。因此,我们还推断,没有任何一种文本表示方法能够独立捕捉文本的特征。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
INCEPT: A Framework for Duplicate Posts Classification with Combined Text Representations
Dealing with many of the problems related to the quality of textual content online involves identifying similar content. Algorithmic solutions for duplicate content classification typically rely on text vector representation, which maps textual information into a set of features. Ideally, this representation would capture all aspects of the underlying text, including length, word frequencies, syntax, and semantics. While recent advancements in text representation have led to improved performance, a comprehensive approach that explicitly incorporates all text features has not yet been proposed. In this study, we present the INCEPT framework that utilizes multiple representation methods to detect duplicate text pairs, taking advantage of their individual strengths. The core of our approach involves using a stacking ensemble of pairwise vector distance measurements that are computed from multiple text representation methods. A stacking classifier then utilizes these distance scores as input and learns to identify duplicate posts. We assess the proposed framework’s effectiveness in identifying duplicate posts in an online Question and Answer platform. By combining several text representation methods, INCEPT performs well in the duplicate posts classification task. Our experiments demonstrate that specific framework configurations outperform the accuracy scores obtained from individual text representation methods. Therefore, we also infer that no single text representation method can independently capture a text’s features.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
ACS Applied Bio Materials
ACS Applied Bio Materials Chemistry-Chemistry (all)
CiteScore
9.40
自引率
2.10%
发文量
464
期刊介绍: ACS Applied Bio Materials is an interdisciplinary journal publishing original research covering all aspects of biomaterials and biointerfaces including and beyond the traditional biosensing, biomedical and therapeutic applications. The journal is devoted to reports of new and original experimental and theoretical research of an applied nature that integrates knowledge in the areas of materials, engineering, physics, bioscience, and chemistry into important bio applications. The journal is specifically interested in work that addresses the relationship between structure and function and assesses the stability and degradation of materials under relevant environmental and biological conditions.
期刊最新文献
Pd-Decorated Ag-CeO2/PANI Nanocomposite: Fabrication, Application in the Synthesis of 2,3-Dihydroquinazolin-4(1H)-ones and Tetrahydrobenzo[b]pyrans, Antimicrobial Activity Evaluation, and Computational Study. A Photothermally Active CuS-Nanocomposite Hydrogel for Postsurgical Melanoma Management and Tissue Regeneration. Luteolin-Loaded WPI/Pueraria lobata Amylopectin Composite Gel Improves Adipocyte Thermogenesis and Insulin Sensitivity. Coating-Free Methods for Forming Liposomes Using Hydrophilic/Hydrophobic Composite Microfluidic Device. 3D Printed Angiogenin-Functionalized Bioresorbable Tubular Conduits for Biological Vascularization.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1