首页 > 最新文献

Proceedings of the ACM Symposium on Document Engineering 2023最新文献

英文 中文
Character Relationship Mapping in Major Fictional Works Using Text Analysis Methods 基于文本分析方法的重要小说人物关系映射
Pub Date : 2023-08-22 DOI: 10.1145/3573128.3609345
Sam Wolyn, S. Simske
Determining the relationships between characters is an important step in analyzing fictional works. Knowing character relationships can be useful when summarizing a work and may also help to determine authorship. In this paper, scores are generated for pairs of characters in fictional works, which can be used for classification tasks if characters have a relationship or not. An SVM is used to predict relationships between characters. Characters farther from the decision boundary often had stronger relationships than those closer to the boundary. The relative rank of the relationships may have additional literary and authorship related purposes.
确定人物之间的关系是分析小说作品的重要步骤。了解人物关系在总结作品时很有用,也有助于确定作者身份。本文对虚构作品中的人物对生成分数,用于人物之间是否存在关系的分类任务。支持向量机用于预测字符之间的关系。远离决策边界的角色往往比靠近决策边界的角色关系更强。关系的相对等级可能有额外的文学和作者相关的目的。
{"title":"Character Relationship Mapping in Major Fictional Works Using Text Analysis Methods","authors":"Sam Wolyn, S. Simske","doi":"10.1145/3573128.3609345","DOIUrl":"https://doi.org/10.1145/3573128.3609345","url":null,"abstract":"Determining the relationships between characters is an important step in analyzing fictional works. Knowing character relationships can be useful when summarizing a work and may also help to determine authorship. In this paper, scores are generated for pairs of characters in fictional works, which can be used for classification tasks if characters have a relationship or not. An SVM is used to predict relationships between characters. Characters farther from the decision boundary often had stronger relationships than those closer to the boundary. The relative rank of the relationships may have additional literary and authorship related purposes.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121273674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Static Pruning for Multi-Representation Dense Retrieval 面向多表示密集检索的静态剪枝
Pub Date : 2023-08-22 DOI: 10.1145/3573128.3604896
A. Acquavia, C. Macdonald, N. Tonellotto
Dense retrieval approaches are challenging the prevalence of inverted index-based sparse representation approaches for information retrieval systems. Different families have arisen: single representations for each query or passage (such as ANCE or DPR), or multiple representations (usually one per token) as exemplified by the ColBERT model. While ColBERT is effective, it requires significant storage space for each token's embedding. In this work, we aim to prune the embeddings for tokens that are not important for effectiveness. Indeed, we show that, by adapting standard uniform and document-centric static pruning methods to embedding-based indexes, but retaining their focus on low-IDF tokens, we can attain large improvements in space efficiency while maintaining high effectiveness. Indeed, on experiments conducted on the MSMARCO passage ranking task, by removing all embeddings corresponding to the 100 most frequent BERT tokens, the index size is reduced by 45%, with limited impact on effectiveness (e.g. no statistically significant degradation of NDCG@10 or MAP on the TREC 2020 queryset). Similarly, on TREC Covid, we observed a 1.3% reduction in nDCG@10 for a 38% reduction in total index size.
密集检索方法对信息检索系统中基于倒排索引的稀疏表示方法提出了挑战。出现了不同的家族:每个查询或通道的单一表示(例如ANCE或DPR),或者多个表示(通常每个令牌一个),如ColBERT模型所示。虽然ColBERT是有效的,但它需要为每个令牌的嵌入提供大量的存储空间。在这项工作中,我们的目标是修剪对有效性不重要的标记的嵌入。实际上,我们表明,通过将标准的统一和以文档为中心的静态剪枝方法应用于基于嵌入的索引,但将其重点放在低idf令牌上,我们可以在保持高效率的同时大幅提高空间效率。事实上,在对MSMARCO通道排序任务进行的实验中,通过删除与100个最频繁的BERT令牌对应的所有嵌入,索引大小减少了45%,对有效性的影响有限(例如,在TREC 2020查询集上没有统计上显着的NDCG@10或MAP退化)。同样,在TREC Covid上,我们观察到nDCG@10减少了1.3%,总索引大小减少了38%。
{"title":"Static Pruning for Multi-Representation Dense Retrieval","authors":"A. Acquavia, C. Macdonald, N. Tonellotto","doi":"10.1145/3573128.3604896","DOIUrl":"https://doi.org/10.1145/3573128.3604896","url":null,"abstract":"Dense retrieval approaches are challenging the prevalence of inverted index-based sparse representation approaches for information retrieval systems. Different families have arisen: single representations for each query or passage (such as ANCE or DPR), or multiple representations (usually one per token) as exemplified by the ColBERT model. While ColBERT is effective, it requires significant storage space for each token's embedding. In this work, we aim to prune the embeddings for tokens that are not important for effectiveness. Indeed, we show that, by adapting standard uniform and document-centric static pruning methods to embedding-based indexes, but retaining their focus on low-IDF tokens, we can attain large improvements in space efficiency while maintaining high effectiveness. Indeed, on experiments conducted on the MSMARCO passage ranking task, by removing all embeddings corresponding to the 100 most frequent BERT tokens, the index size is reduced by 45%, with limited impact on effectiveness (e.g. no statistically significant degradation of NDCG@10 or MAP on the TREC 2020 queryset). Similarly, on TREC Covid, we observed a 1.3% reduction in nDCG@10 for a 38% reduction in total index size.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129105906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Automatically Inferring the Document Class of a Scientific Article 自动推断科学论文的文献类别
Pub Date : 2023-08-22 DOI: 10.1145/3573128.3604894
Antoine Gauquier, P. Senellart
We consider the problem of automatically inferring the (LATEX) document class used to write a scientific article from its PDF representation. Applications include improving the performance of information extraction techniques that rely on the style used in each document class, or determining the publisher of a given scientific article. We introduce two approaches: a simple classifier based on hand-coded document style features, as well as a CNN-based classifier taking as input the bitmap representation of the first page of the PDF article. We experiment on a dataset of around 100k articles from arXiv, where labels come from the source LATEX document associated to each article. Results show the CNN approach significantly outperforms that based on simple document style features, reaching over 90% average F1-score on a task to distinguish among several dozens of the most common document classes.
我们考虑从PDF表示中自动推断用于编写科学文章的(LATEX)文档类的问题。应用程序包括改进依赖于每个文档类中使用的样式的信息提取技术的性能,或者确定给定科学文章的发布者。我们介绍了两种方法:一种是基于手工编码文档样式特征的简单分类器,另一种是基于cnn的分类器,它将PDF文章的第一页的位图表示作为输入。我们对来自arXiv的大约10万篇文章的数据集进行了实验,其中标签来自与每篇文章相关的源LATEX文档。结果表明,CNN方法明显优于基于简单文档样式特征的方法,在区分几十种最常见文档类别的任务中达到90%以上的平均f1分。
{"title":"Automatically Inferring the Document Class of a Scientific Article","authors":"Antoine Gauquier, P. Senellart","doi":"10.1145/3573128.3604894","DOIUrl":"https://doi.org/10.1145/3573128.3604894","url":null,"abstract":"We consider the problem of automatically inferring the (LATEX) document class used to write a scientific article from its PDF representation. Applications include improving the performance of information extraction techniques that rely on the style used in each document class, or determining the publisher of a given scientific article. We introduce two approaches: a simple classifier based on hand-coded document style features, as well as a CNN-based classifier taking as input the bitmap representation of the first page of the PDF article. We experiment on a dataset of around 100k articles from arXiv, where labels come from the source LATEX document associated to each article. Results show the CNN approach significantly outperforms that based on simple document style features, reaching over 90% average F1-score on a task to distinguish among several dozens of the most common document classes.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127174098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
AI-powered Resume-Job matching: A document ranking approach using deep neural networks 人工智能驱动的简历-工作匹配:使用深度神经网络的文档排序方法
Pub Date : 2023-08-22 DOI: 10.1145/3573128.3609347
Sima Rezaeipourfarsangi, E. Milios
This study focuses on the importance of well-designed online matching systems for job seekers and employers. We treat resumes and job descriptions as documents. Then, calculate their similarity to determine the suitability of applicants, and rank a set of resumes based on their similarity to a specific job description. We employ Siamese Neural Networks, comprised of identical sub-network components, to evaluate the semantic similarity between documents. Our novel architecture integrates various neural network architectures, where each sub-network incorporates multiple layers such as CNN, LSTM and attention layers to capture sequential, local and global patterns within the data. The LSTM and CNN components are applied concurrently and merged together. The resulting output is then fed into a multi-head attention layer. These layers extract features and capture document representations. The extracted features are then combined to form a unified representation of the document. We leverage pre-trained language models to obtain embeddings for each document, which serve as a lower-dimensional representation of our input data. The model is trained on a private dataset of 268,549 real resumes and 4,198 job descriptions from twelve industry sectors, resulting in a ranked list of matched resumes. We performed a comparative analysis involving our model, Siamese CNN (S-CNNs), Siamese LSTM with Manhattan distance, and a BERT-based sentence transformer model. By combining the power of language models and the novel Siamese architecture, this approach leverages both strengths to improve document ranking accuracy and enhance the matching process between job descriptions and resumes. Our experimental results demonstrate that our model outperforms other models in terms of performance.
本研究的重点是设计良好的在线匹配系统对求职者和雇主的重要性。我们将简历和职位描述视为文件。然后,计算他们的相似度来确定申请人的适用性,并根据他们与特定职位描述的相似度对一组简历进行排名。我们使用由相同的子网络组件组成的暹罗神经网络来评估文档之间的语义相似性。我们的新架构集成了各种神经网络架构,其中每个子网络包含多层,如CNN, LSTM和注意力层,以捕获数据中的顺序,局部和全局模式。LSTM和CNN组件同时应用并合并在一起。然后将结果输出馈送到一个多头注意层。这些层提取特征并捕获文档表示。然后将提取的特征组合起来,形成文档的统一表示。我们利用预训练的语言模型来获得每个文档的嵌入,这些嵌入作为输入数据的低维表示。该模型在一个私人数据集上进行训练,该数据集包含来自12个行业的268,549份真实简历和4,198份职位描述,从而得出匹配简历的排名列表。我们对我们的模型、Siamese CNN (s -CNN)、具有曼哈顿距离的Siamese LSTM和基于bert的句子转换模型进行了比较分析。通过结合语言模型的强大功能和新颖的Siamese架构,该方法利用了这两种优势来提高文档排序的准确性,并增强了职位描述和简历之间的匹配过程。我们的实验结果表明,我们的模型在性能方面优于其他模型。
{"title":"AI-powered Resume-Job matching: A document ranking approach using deep neural networks","authors":"Sima Rezaeipourfarsangi, E. Milios","doi":"10.1145/3573128.3609347","DOIUrl":"https://doi.org/10.1145/3573128.3609347","url":null,"abstract":"This study focuses on the importance of well-designed online matching systems for job seekers and employers. We treat resumes and job descriptions as documents. Then, calculate their similarity to determine the suitability of applicants, and rank a set of resumes based on their similarity to a specific job description. We employ Siamese Neural Networks, comprised of identical sub-network components, to evaluate the semantic similarity between documents. Our novel architecture integrates various neural network architectures, where each sub-network incorporates multiple layers such as CNN, LSTM and attention layers to capture sequential, local and global patterns within the data. The LSTM and CNN components are applied concurrently and merged together. The resulting output is then fed into a multi-head attention layer. These layers extract features and capture document representations. The extracted features are then combined to form a unified representation of the document. We leverage pre-trained language models to obtain embeddings for each document, which serve as a lower-dimensional representation of our input data. The model is trained on a private dataset of 268,549 real resumes and 4,198 job descriptions from twelve industry sectors, resulting in a ranked list of matched resumes. We performed a comparative analysis involving our model, Siamese CNN (S-CNNs), Siamese LSTM with Manhattan distance, and a BERT-based sentence transformer model. By combining the power of language models and the novel Siamese architecture, this approach leverages both strengths to improve document ranking accuracy and enhance the matching process between job descriptions and resumes. Our experimental results demonstrate that our model outperforms other models in terms of performance.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124775272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quality, Space and Time Competition on Binarizing Photographed Document Images 摄影文献图像二值化的质量、空间和时间竞争
Pub Date : 2023-08-22 DOI: 10.1145/3573128.3604903
R. Lins, Gabriel de F. Pe Silva, Gustavo P. Chaves, Ricardo da Silva Barboza, R. Bernardino, S. Simske
Document image binarization is a fundamental step in many document processes. No binarization algorithm performs well on all types of document images, as the different kinds of digitalization devices and the physical noises present in the document and acquired in the digitalization process alter their performance. Besides that, the processing time is also an important factor that may restrict its applicability. This competition on binarizing photographed documents assessed the quality, time, space, and performance of five new algorithms and sixty-four "classical" and alternative algorithms. The evaluation dataset is composed of laser and deskjet printed documents, photographed using six widely-used mobile devices with the strobe flash on and off, under two different angles and places of capture.
文档图像二值化是许多文档处理的基本步骤。没有一种二值化算法能很好地处理所有类型的文档图像,因为不同类型的数字化设备以及文档中存在的和数字化过程中获得的物理噪声会改变它们的性能。此外,处理时间也是制约其适用性的重要因素。这次关于二值化照片文件的竞赛评估了五种新算法和64种“经典”算法和替代算法的质量、时间、空间和性能。评估数据集由激光和桌面喷墨打印的文件组成,使用六种广泛使用的移动设备,在两个不同的角度和位置拍摄,并打开和关闭闪光灯。
{"title":"Quality, Space and Time Competition on Binarizing Photographed Document Images","authors":"R. Lins, Gabriel de F. Pe Silva, Gustavo P. Chaves, Ricardo da Silva Barboza, R. Bernardino, S. Simske","doi":"10.1145/3573128.3604903","DOIUrl":"https://doi.org/10.1145/3573128.3604903","url":null,"abstract":"Document image binarization is a fundamental step in many document processes. No binarization algorithm performs well on all types of document images, as the different kinds of digitalization devices and the physical noises present in the document and acquired in the digitalization process alter their performance. Besides that, the processing time is also an important factor that may restrict its applicability. This competition on binarizing photographed documents assessed the quality, time, space, and performance of five new algorithms and sixty-four \"classical\" and alternative algorithms. The evaluation dataset is composed of laser and deskjet printed documents, photographed using six widely-used mobile devices with the strobe flash on and off, under two different angles and places of capture.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128525780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Label Dependencies for Multi-Label Document Classification Using Transformers 利用标签依赖关系进行多标签文档分类
Pub Date : 2023-08-22 DOI: 10.1145/3573128.3609356
Haytame Fallah, Emmanuel Bruno, P. Bellot, Elisabeth Murisasco
We introduce in this paper a new approach to improve deep learning-based architectures for multi-label document classification. Dependencies between labels are an essential factor in the multi-label context. Our proposed strategy takes advantage of the knowledge extracted from label co-occurrences. The proposed method consists in adding a regularization term to the loss function used for training the model, in a way that incorporates the label similarities given by the label co-occurrences to encourage the model to jointly predict labels that are likely to co-occur, and and not consider labels that are rarely present with each other. This allows the neural model to better capture label dependencies. Our approach was evaluated on three datasets: the standard AAPD dataset, a corpus of scientific abstracts and Reuters-21578, a collection of news articles, and a newly proposed multi-label dataset called arXiv-ACM. Our method demonstrates improved performance, setting a new state-of-the-art on all three datasets.
本文介绍了一种改进基于深度学习的多标签文档分类体系结构的新方法。标签之间的依赖关系是多标签上下文中的一个重要因素。我们提出的策略利用了从标签共现中提取的知识。所提出的方法包括在用于训练模型的损失函数中添加一个正则化项,以一种结合标签共现给出的标签相似度的方式,鼓励模型联合预测可能共现的标签,而不考虑很少出现的标签。这允许神经模型更好地捕获标签依赖关系。我们的方法在三个数据集上进行了评估:标准的AAPD数据集,科学摘要和路透社-21578的语料库,新闻文章的集合,以及新提出的多标签数据集arXiv-ACM。我们的方法展示了改进的性能,在所有三个数据集上设置了一个新的最先进的状态。
{"title":"Exploiting Label Dependencies for Multi-Label Document Classification Using Transformers","authors":"Haytame Fallah, Emmanuel Bruno, P. Bellot, Elisabeth Murisasco","doi":"10.1145/3573128.3609356","DOIUrl":"https://doi.org/10.1145/3573128.3609356","url":null,"abstract":"We introduce in this paper a new approach to improve deep learning-based architectures for multi-label document classification. Dependencies between labels are an essential factor in the multi-label context. Our proposed strategy takes advantage of the knowledge extracted from label co-occurrences. The proposed method consists in adding a regularization term to the loss function used for training the model, in a way that incorporates the label similarities given by the label co-occurrences to encourage the model to jointly predict labels that are likely to co-occur, and and not consider labels that are rarely present with each other. This allows the neural model to better capture label dependencies. Our approach was evaluated on three datasets: the standard AAPD dataset, a corpus of scientific abstracts and Reuters-21578, a collection of news articles, and a newly proposed multi-label dataset called arXiv-ACM. Our method demonstrates improved performance, setting a new state-of-the-art on all three datasets.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128600283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A PDF Malware Detection Method Using Extremely Small Training Sample Size 一种使用极小训练样本大小的PDF恶意软件检测方法
Pub Date : 2023-08-22 DOI: 10.1145/3573128.3609352
Ran Liu, Cynthia Matuszek, Charles Nicholas
Machine learning-based methods for PDF malware detection have grown in popularity because of their high levels of accuracy. However, many well-known ML-based detectors require a large number of specimen features to be collected before making a decision, which can be time-consuming. In this study, we present a novel, distance-based method for detecting PDF malware. Notably, our approach needs significantly less training data compared to traditional machine learning or neural network models. We evaluated our method using the Contagio dataset and reported that it can detect 90.50% of malware samples with only 20 benign PDF files used for model training. To show the statistical significance, we reported results with a 95% confidence interval (CI). We evaluated our model's performance across multiple metrics including Accuracy, F1 score, Precision, and Recall, alongside False Positive Rate, False Negative Rates, True Positive Rate and True Negative Rates. This paper highlights the feasibility of using distance-based methods for PDF malware detection, even with limited training data, thereby offering a promising direction for future research.
基于机器学习的PDF恶意软件检测方法由于其高准确性而越来越受欢迎。然而,许多知名的基于ml的检测器在做出决定之前需要收集大量的标本特征,这可能很耗时。在这项研究中,我们提出了一种新的,基于距离的方法来检测PDF恶意软件。值得注意的是,与传统的机器学习或神经网络模型相比,我们的方法需要的训练数据要少得多。我们使用Contagio数据集评估了我们的方法,并报告说,仅使用20个用于模型训练的良性PDF文件,它就能检测出90.50%的恶意软件样本。为了显示统计显著性,我们以95%的置信区间(CI)报告结果。我们通过多个指标评估了我们的模型的性能,包括准确性、F1分数、精度和召回率,以及假阳性率、假阴性率、真阳性率和真阴性率。本文强调了在训练数据有限的情况下,使用基于距离的方法进行PDF恶意软件检测的可行性,从而为未来的研究提供了一个有希望的方向。
{"title":"A PDF Malware Detection Method Using Extremely Small Training Sample Size","authors":"Ran Liu, Cynthia Matuszek, Charles Nicholas","doi":"10.1145/3573128.3609352","DOIUrl":"https://doi.org/10.1145/3573128.3609352","url":null,"abstract":"Machine learning-based methods for PDF malware detection have grown in popularity because of their high levels of accuracy. However, many well-known ML-based detectors require a large number of specimen features to be collected before making a decision, which can be time-consuming. In this study, we present a novel, distance-based method for detecting PDF malware. Notably, our approach needs significantly less training data compared to traditional machine learning or neural network models. We evaluated our method using the Contagio dataset and reported that it can detect 90.50% of malware samples with only 20 benign PDF files used for model training. To show the statistical significance, we reported results with a 95% confidence interval (CI). We evaluated our model's performance across multiple metrics including Accuracy, F1 score, Precision, and Recall, alongside False Positive Rate, False Negative Rates, True Positive Rate and True Negative Rates. This paper highlights the feasibility of using distance-based methods for PDF malware detection, even with limited training data, thereby offering a promising direction for future research.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"199 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127210136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LYLAA: A Lightweight YOLO based Legend and Axis Analysis method for CHART-Infographics LYLAA:一个轻量级的基于YOLO的图例和轴分析方法的图表信息图
Pub Date : 2023-08-22 DOI: 10.1145/3573128.3609355
Hadia Showkat Kawoosa, Muhammad Suhaib Kanroo, P. Goyal
Chart Data Extraction (CDE) is a complex task in document analysis that involves extracting data from charts to facilitate accessibility for various applications, such as document mining, medical diagnosis, and accessibility for the visually impaired. CDE is challenging due to the intricate structure and specific semantics of charts, which include elements such as title, axis, legend, and plot elements. The existing solutions for CDE have not yet satisfactorily addressed these issues. In this paper, we focus on two critical subtasks in CDE, Legend Analysis and Axis Analysis, and present a lightweight YOLO-based method for detection and domain-specific heuristic algorithms (Axis Matching and Legend Matching), for matching. We evaluate the efficacy of our proposed method, LYLAA, on a real-world dataset, the ICPR2022 UB PMC dataset, and observe promising results compared to the competing teams in the ICPR2022 CHART-Infographics competition. Our findings showcase the potential of our proposed method in the CDE process.
图表数据提取(CDE)是文档分析中的一项复杂任务,它涉及从图表中提取数据,以促进各种应用程序的可访问性,例如文档挖掘、医疗诊断和视障人士的可访问性。由于图表的复杂结构和特定语义,CDE具有挑战性,其中包括标题、轴、图例和情节元素等元素。CDE的现有解决方案尚未令人满意地解决这些问题。在本文中,我们关注CDE中的两个关键子任务,图例分析和轴分析,并提出了一种轻量级的基于yolo的检测方法和特定领域的启发式算法(轴匹配和图例匹配),用于匹配。我们评估了我们提出的方法LYLAA在现实世界数据集ICPR2022 UB PMC数据集上的有效性,并与ICPR2022 CHART-Infographics竞赛中的竞争团队相比,观察到有希望的结果。我们的发现展示了我们提出的方法在CDE过程中的潜力。
{"title":"LYLAA: A Lightweight YOLO based Legend and Axis Analysis method for CHART-Infographics","authors":"Hadia Showkat Kawoosa, Muhammad Suhaib Kanroo, P. Goyal","doi":"10.1145/3573128.3609355","DOIUrl":"https://doi.org/10.1145/3573128.3609355","url":null,"abstract":"Chart Data Extraction (CDE) is a complex task in document analysis that involves extracting data from charts to facilitate accessibility for various applications, such as document mining, medical diagnosis, and accessibility for the visually impaired. CDE is challenging due to the intricate structure and specific semantics of charts, which include elements such as title, axis, legend, and plot elements. The existing solutions for CDE have not yet satisfactorily addressed these issues. In this paper, we focus on two critical subtasks in CDE, Legend Analysis and Axis Analysis, and present a lightweight YOLO-based method for detection and domain-specific heuristic algorithms (Axis Matching and Legend Matching), for matching. We evaluate the efficacy of our proposed method, LYLAA, on a real-world dataset, the ICPR2022 UB PMC dataset, and observe promising results compared to the competing teams in the ICPR2022 CHART-Infographics competition. Our findings showcase the potential of our proposed method in the CDE process.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123467161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tabular Corner Detection in Historical Irish Records 爱尔兰历史记录中的表格角检测
Pub Date : 2023-08-22 DOI: 10.1145/3573128.3609349
Enda O'Shea
The process of extracting relevant data from historical handwritten documents can be time-consuming and challenging. In Ireland, from 1864 to 1922, government records regarding births, deaths, and marriages were documented by local registrars using printed tabular structures. Leveraging this systematic approach, we employ a neural network capable of segmenting scanned versions of these record documents. We sought to isolate the corner points with the goal of extracting the vital tabular elements and transforming them into consistently structured standalone images. By achieving uniformity in the segmented images, we enable more accurate row and column segmentation, enhancing our ability to isolate and classify individual cell contents effectively. This process must accommodate varying image qualities, different tabular orientations and sizes resulting from diverse scanning procedures, as well as faded and damaged ink lines that naturally occur over time.
从历史手写文档中提取相关数据的过程既耗时又具有挑战性。在爱尔兰,从1864年到1922年,有关出生、死亡和婚姻的政府记录由当地登记员使用印刷表格结构进行记录。利用这种系统的方法,我们使用了一个神经网络,能够分割这些记录文档的扫描版本。我们试图隔离角点,目的是提取重要的表格元素,并将它们转换为结构一致的独立图像。通过实现分割图像的均匀性,我们可以实现更准确的行和列分割,增强我们有效分离和分类单个细胞内容的能力。这个过程必须适应不同的图像质量,不同的扫描程序产生的不同的表格方向和大小,以及随着时间的推移自然出现的褪色和损坏的墨迹线。
{"title":"Tabular Corner Detection in Historical Irish Records","authors":"Enda O'Shea","doi":"10.1145/3573128.3609349","DOIUrl":"https://doi.org/10.1145/3573128.3609349","url":null,"abstract":"The process of extracting relevant data from historical handwritten documents can be time-consuming and challenging. In Ireland, from 1864 to 1922, government records regarding births, deaths, and marriages were documented by local registrars using printed tabular structures. Leveraging this systematic approach, we employ a neural network capable of segmenting scanned versions of these record documents. We sought to isolate the corner points with the goal of extracting the vital tabular elements and transforming them into consistently structured standalone images. By achieving uniformity in the segmented images, we enable more accurate row and column segmentation, enhancing our ability to isolate and classify individual cell contents effectively. This process must accommodate varying image qualities, different tabular orientations and sizes resulting from diverse scanning procedures, as well as faded and damaged ink lines that naturally occur over time.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"215 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128169739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep-learning for dysgraphia detection in children handwritings 儿童书写困难的深度学习检测
Pub Date : 2023-08-22 DOI: 10.1145/3573128.3609351
Andrea Gemelli, S. Marinai, Emanuele Vivoli, T. Zappaterra
Early identification of dysgraphia in children is crucial for timely intervention and support. Traditional methods, such as the Brave Handwriting Kinder (BHK) test, which relies on manual scoring of handwritten sentences, are both time-consuming and subjective posing challenges in accurate and efficient diagnosis. In this paper, an approach for dysgraphia detection by leveraging smart pens and deep learning techniques is proposed, automatically extracting visual features from children's handwriting samples. To validate the solution, samples of children handwritings have been gathered and several interviews with domain experts have been conducted. The approach has been compared with an algorithmic version of the BHK test and with several elementary school teachers' interviews.
儿童书写障碍的早期识别对于及时干预和支持至关重要。传统的方法,如Brave Handwriting Kinder (BHK)测试,依赖于手写句子的人工评分,既耗时又主观,对准确高效的诊断提出了挑战。本文提出了一种利用智能笔和深度学习技术自动提取儿童笔迹样本视觉特征的书写障碍检测方法。为了验证该解决方案,收集了儿童手迹样本,并与领域专家进行了多次访谈。该方法已与BHK测试的算法版本和几位小学教师的访谈进行了比较。
{"title":"Deep-learning for dysgraphia detection in children handwritings","authors":"Andrea Gemelli, S. Marinai, Emanuele Vivoli, T. Zappaterra","doi":"10.1145/3573128.3609351","DOIUrl":"https://doi.org/10.1145/3573128.3609351","url":null,"abstract":"Early identification of dysgraphia in children is crucial for timely intervention and support. Traditional methods, such as the Brave Handwriting Kinder (BHK) test, which relies on manual scoring of handwritten sentences, are both time-consuming and subjective posing challenges in accurate and efficient diagnosis. In this paper, an approach for dysgraphia detection by leveraging smart pens and deep learning techniques is proposed, automatically extracting visual features from children's handwriting samples. To validate the solution, samples of children handwritings have been gathered and several interviews with domain experts have been conducted. The approach has been compared with an algorithmic version of the BHK test and with several elementary school teachers' interviews.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115137924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the ACM Symposium on Document Engineering 2023
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1