User-generated short-text classification using cograph editing-based network clustering with an application in invoice categorization

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Data & Knowledge Engineering Pub Date : 2023-11-01 DOI:10.1016/j.datak.2023.102238

Dewan F. Wahid , Elkafi Hassini

{"title":"User-generated short-text classification using cograph editing-based network clustering with an application in invoice categorization","authors":"Dewan F. Wahid , Elkafi Hassini","doi":"10.1016/j.datak.2023.102238","DOIUrl":null,"url":null,"abstract":"<div><p>Rapid adaptation of online business platforms in every sector creates an enormous amount of user-generated textual data related to providing product or service descriptions, reviewing, marketing, invoicing and bookkeeping. These data are often short in size, noisy (e.g., misspellings, abbreviations), and do not have accurate classifying labels (line-item categories). Classifying these user-generated short-text data with appropriate line-item categories is crucial for corresponding platforms to understand users’ needs. This paper proposed a framework for user-generated short-text classification based on identified line-item categories. In the line-item identification phase, we used cograph editing (CoE)-based clustering on keywords network, which can be formulated from users’ generated short-texts. We also proposed integer linear programming (ILP) formulations for CoE on weighted networks and designed a heuristic algorithm to identify clusters in large-scale networks. Finally, we outlined an application of this framework to categorize invoices in an empirical setting. Our framework showed promising results in identifying invoice line-item categories for large-scale data.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"148 ","pages":"Article 102238"},"PeriodicalIF":2.7000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data & Knowledge Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169023X23000988","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Rapid adaptation of online business platforms in every sector creates an enormous amount of user-generated textual data related to providing product or service descriptions, reviewing, marketing, invoicing and bookkeeping. These data are often short in size, noisy (e.g., misspellings, abbreviations), and do not have accurate classifying labels (line-item categories). Classifying these user-generated short-text data with appropriate line-item categories is crucial for corresponding platforms to understand users’ needs. This paper proposed a framework for user-generated short-text classification based on identified line-item categories. In the line-item identification phase, we used cograph editing (CoE)-based clustering on keywords network, which can be formulated from users’ generated short-texts. We also proposed integer linear programming (ILP) formulations for CoE on weighted networks and designed a heuristic algorithm to identify clusters in large-scale networks. Finally, we outlined an application of this framework to categorize invoices in an empirical setting. Our framework showed promising results in identifying invoice line-item categories for large-scale data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于图形编辑的网络聚类用户生成短文本分类及其在发票分类中的应用

每个领域的在线商业平台的快速适应创造了大量的用户生成的文本数据，这些数据与提供产品或服务描述、审查、营销、发票和簿记有关。这些数据通常是短的，嘈杂的(例如，拼写错误，缩写)，并且没有准确的分类标签(行-项分类)。将这些用户生成的短文本数据分类为适当的行项分类对于相应的平台了解用户需求至关重要。本文提出了一种基于已识别的行项分类的用户生成短文本分类框架。在行项识别阶段，我们使用基于cograph编辑(CoE)的关键词网络聚类，该网络可以从用户生成的短文本中形成。我们还提出了加权网络上的整数线性规划(ILP)公式，并设计了一种启发式算法来识别大规模网络中的聚类。最后，我们概述了该框架在经验设置中对发票进行分类的应用程序。我们的框架在识别大规模数据的发票行-项类别方面显示出有希望的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Data & Knowledge Engineering 工程技术-计算机：人工智能

CiteScore

5.00

自引率

0.00%

发文量

审稿时长

6 months

期刊介绍： Data & Knowledge Engineering (DKE) stimulates the exchange of ideas and interaction between these two related fields of interest. DKE reaches a world-wide audience of researchers, designers, managers and users. The major aim of the journal is to identify, investigate and analyze the underlying principles in the design and effective use of these systems.

期刊最新文献

When temporary results meet intermediate index: An optimization technique of procedural SQL query processing Heterogeneity in entity matching: A survey and experimental analysis Editorial Board Rankingdom: A cooperative architecture for the on-demand analysis of Wikidata A large-scale multi-disciplinary analysis of uncertainty in research articles