Erjon Skenderi, Jukka Huhtamäki, Salla-Maaria Laaksonen, Kostas Stefanidis
{"title":"INCEPT:利用组合文本表示法进行重复帖子分类的框架","authors":"Erjon Skenderi, Jukka Huhtamäki, Salla-Maaria Laaksonen, Kostas Stefanidis","doi":"10.1145/3677322","DOIUrl":null,"url":null,"abstract":"Dealing with many of the problems related to the quality of textual content online involves identifying similar content. Algorithmic solutions for duplicate content classification typically rely on text vector representation, which maps textual information into a set of features. Ideally, this representation would capture all aspects of the underlying text, including length, word frequencies, syntax, and semantics. While recent advancements in text representation have led to improved performance, a comprehensive approach that explicitly incorporates all text features has not yet been proposed. In this study, we present the INCEPT framework that utilizes multiple representation methods to detect duplicate text pairs, taking advantage of their individual strengths. The core of our approach involves using a stacking ensemble of pairwise vector distance measurements that are computed from multiple text representation methods. A stacking classifier then utilizes these distance scores as input and learns to identify duplicate posts. We assess the proposed framework’s effectiveness in identifying duplicate posts in an online Question and Answer platform. By combining several text representation methods, INCEPT performs well in the duplicate posts classification task. Our experiments demonstrate that specific framework configurations outperform the accuracy scores obtained from individual text representation methods. Therefore, we also infer that no single text representation method can independently capture a text’s features.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":null,"pages":null},"PeriodicalIF":2.6000,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"INCEPT: A Framework for Duplicate Posts Classification with Combined Text Representations\",\"authors\":\"Erjon Skenderi, Jukka Huhtamäki, Salla-Maaria Laaksonen, Kostas Stefanidis\",\"doi\":\"10.1145/3677322\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Dealing with many of the problems related to the quality of textual content online involves identifying similar content. Algorithmic solutions for duplicate content classification typically rely on text vector representation, which maps textual information into a set of features. Ideally, this representation would capture all aspects of the underlying text, including length, word frequencies, syntax, and semantics. While recent advancements in text representation have led to improved performance, a comprehensive approach that explicitly incorporates all text features has not yet been proposed. In this study, we present the INCEPT framework that utilizes multiple representation methods to detect duplicate text pairs, taking advantage of their individual strengths. The core of our approach involves using a stacking ensemble of pairwise vector distance measurements that are computed from multiple text representation methods. A stacking classifier then utilizes these distance scores as input and learns to identify duplicate posts. We assess the proposed framework’s effectiveness in identifying duplicate posts in an online Question and Answer platform. By combining several text representation methods, INCEPT performs well in the duplicate posts classification task. Our experiments demonstrate that specific framework configurations outperform the accuracy scores obtained from individual text representation methods. Therefore, we also infer that no single text representation method can independently capture a text’s features.\",\"PeriodicalId\":50940,\"journal\":{\"name\":\"ACM Transactions on the Web\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2024-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on the Web\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3677322\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on the Web","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3677322","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
INCEPT: A Framework for Duplicate Posts Classification with Combined Text Representations
Dealing with many of the problems related to the quality of textual content online involves identifying similar content. Algorithmic solutions for duplicate content classification typically rely on text vector representation, which maps textual information into a set of features. Ideally, this representation would capture all aspects of the underlying text, including length, word frequencies, syntax, and semantics. While recent advancements in text representation have led to improved performance, a comprehensive approach that explicitly incorporates all text features has not yet been proposed. In this study, we present the INCEPT framework that utilizes multiple representation methods to detect duplicate text pairs, taking advantage of their individual strengths. The core of our approach involves using a stacking ensemble of pairwise vector distance measurements that are computed from multiple text representation methods. A stacking classifier then utilizes these distance scores as input and learns to identify duplicate posts. We assess the proposed framework’s effectiveness in identifying duplicate posts in an online Question and Answer platform. By combining several text representation methods, INCEPT performs well in the duplicate posts classification task. Our experiments demonstrate that specific framework configurations outperform the accuracy scores obtained from individual text representation methods. Therefore, we also infer that no single text representation method can independently capture a text’s features.
期刊介绍:
Transactions on the Web (TWEB) is a journal publishing refereed articles reporting the results of research on Web content, applications, use, and related enabling technologies. Topics in the scope of TWEB include but are not limited to the following: Browsers and Web Interfaces; Electronic Commerce; Electronic Publishing; Hypertext and Hypermedia; Semantic Web; Web Engineering; Web Services; and Service-Oriented Computing XML.
In addition, papers addressing the intersection of the following broader technologies with the Web are also in scope: Accessibility; Business Services Education; Knowledge Management and Representation; Mobility and pervasive computing; Performance and scalability; Recommender systems; Searching, Indexing, Classification, Retrieval and Querying, Data Mining and Analysis; Security and Privacy; and User Interfaces.
Papers discussing specific Web technologies, applications, content generation and management and use are within scope. Also, papers describing novel applications of the web as well as papers on the underlying technologies are welcome.