Proceedings of the ACM Symposium on Document Engineering 2023最新文献

英文中文

Addressing the gap between current language models and key-term-based clustering 解决当前语言模型和基于关键词的聚类之间的差距

Proceedings of the ACM Symposium on Document Engineering 2023

Pub Date : 2023-08-22 DOI: 10.1145/3573128.3604900

Eric M. Cabral, Sima Rezaeipourfarsangi, Maria Cristina Ferreira de Oliveira, E. Milios, R. Minghim

This paper presents MOD-kt, a modular framework designed to bridge the gap between modern language models and key-term-based document clustering. One of the main challenges of using neural language models for key-term-based clustering is the mismatch between the interpretability of the underlying document representation (i.e. document embeddings) and the more intuitive semantic elements that allow the user to guide the clustering process (i.e. key-terms). Our framework acts as a communication layer between word and document models, enabling key-term-based clustering in the context of document and word models with a flexible and adaptable architecture. We report a comparison of the performance of multiple neural language models on clustering, considering a selected range of relevance metrics. Additionally, a qualitative user study was conducted to illustrate the framework's potential for intuitive user-guided quality clustering of document collections.

本文提出了MOD-kt，一个模块化框架，旨在弥合现代语言模型和基于关键词的文档聚类之间的差距。使用神经语言模型进行基于关键词的聚类的主要挑战之一是底层文档表示的可解释性(即文档嵌入)与允许用户指导聚类过程的更直观的语义元素(即关键词)之间的不匹配。我们的框架充当单词和文档模型之间的通信层，支持在文档和单词模型上下文中使用灵活且可适应的体系结构进行基于关键词的聚类。我们报告了多个神经语言模型在聚类上的性能比较，考虑到选择的相关指标范围。此外，进行了定性用户研究，以说明该框架在直观的用户引导文档集合的质量聚类方面的潜力。

引用次数: 0

WEATHERGOV+: A Table Recognition and Summarization Dataset to Bridge the Gap Between Document Image Analysis and Natural Language Generation WEATHERGOV+:一个表识别和汇总数据集，以弥合文档图像分析和自然语言生成之间的差距

Proceedings of the ACM Symposium on Document Engineering 2023

Pub Date : 2023-08-22 DOI: 10.1145/3573128.3604901

Amanda Dash, Melissa Cote, A. Albu

Tables, ubiquitous in data-oriented documents like scientific papers and financial statements, organize and convey relational information. Automatic table recognition from document images, which involves detection within the page, structural segmentation into rows, columns, and cells, and information extraction from cells, has been a popular research topic in document image analysis (DIA). With recent advances in natural language generation (NLG) based on deep neural networks, data-to-text generation, in particular for table summarization, offers interesting solutions to time-intensive data analysis. In this paper, we aim to bridge the gap between efforts in DIA and NLG regarding tabular data: we propose WEATHERGOV+, a dataset building upon the WEATHERGOV dataset, the standard for tabular data summarization techniques, that allows for the training and testing of end-to-end methods working from input document images to generate text summaries as output. WEATHERGOV+ contains images of tables created from the tabular data of WEATHERGOV using visual variations that cover various levels of difficulty, along with the corresponding human-generated table summaries of WEATHERGOV. We also propose an end-to-end pipeline that compares state-of-the-art table recognition methods for summarization purposes. We analyse the results of the proposed pipeline by evaluating WEATHERGOV+ at each stage of the pipeline to identify the effects of error propagation and the weaknesses of the current methods, such as OCR errors. With this research (dataset and code available here1), we hope to encourage new research for the processing and management of inter- and intra-document collections.

表格在科学论文和财务报表等面向数据的文件中无处不在，它组织和传达关系信息。文档图像的自动表识别是文档图像分析(DIA)中的一个热门研究课题，它涉及到页面内的检测、对行、列和单元的结构分割以及从单元中提取信息。随着基于深度神经网络的自然语言生成(NLG)的最新进展，数据到文本生成，特别是表摘要生成，为时间密集型数据分析提供了有趣的解决方案。在本文中，我们的目标是弥合DIA和NLG在表格数据方面的努力之间的差距:我们提出了WEATHERGOV+，这是一个建立在WEATHERGOV数据集(表格数据汇总技术的标准)之上的数据集，它允许训练和测试从输入文档图像到生成文本摘要作为输出的端到端方法。WEATHERGOV+包含从WEATHERGOV的表格数据创建的表格图像，使用涵盖不同难度级别的视觉变化，以及相应的WEATHERGOV人工生成的表格摘要。我们还提出了一个端到端管道，用于比较最先进的表识别方法以进行汇总。我们通过在管道的每个阶段评估WEATHERGOV+来分析拟议管道的结果，以确定误差传播的影响和当前方法的弱点，如OCR误差。通过这项研究(此处提供数据集和代码1)，我们希望鼓励对文档间和文档内集合的处理和管理进行新的研究。

{"title":"WEATHERGOV+: A Table Recognition and Summarization Dataset to Bridge the Gap Between Document Image Analysis and Natural Language Generation","authors":"Amanda Dash, Melissa Cote, A. Albu","doi":"10.1145/3573128.3604901","DOIUrl":"https://doi.org/10.1145/3573128.3604901","url":null,"abstract":"Tables, ubiquitous in data-oriented documents like scientific papers and financial statements, organize and convey relational information. Automatic table recognition from document images, which involves detection within the page, structural segmentation into rows, columns, and cells, and information extraction from cells, has been a popular research topic in document image analysis (DIA). With recent advances in natural language generation (NLG) based on deep neural networks, data-to-text generation, in particular for table summarization, offers interesting solutions to time-intensive data analysis. In this paper, we aim to bridge the gap between efforts in DIA and NLG regarding tabular data: we propose WEATHERGOV+, a dataset building upon the WEATHERGOV dataset, the standard for tabular data summarization techniques, that allows for the training and testing of end-to-end methods working from input document images to generate text summaries as output. WEATHERGOV+ contains images of tables created from the tabular data of WEATHERGOV using visual variations that cover various levels of difficulty, along with the corresponding human-generated table summaries of WEATHERGOV. We also propose an end-to-end pipeline that compares state-of-the-art table recognition methods for summarization purposes. We analyse the results of the proposed pipeline by evaluating WEATHERGOV+ at each stage of the pipeline to identify the effects of error propagation and the weaknesses of the current methods, such as OCR errors. With this research (dataset and code available here1), we hope to encourage new research for the processing and management of inter- and intra-document collections.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"35 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126776820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Genetic Generative Information Retrieval 遗传生成信息检索

Proceedings of the ACM Symposium on Document Engineering 2023

Pub Date : 2023-08-22 DOI: 10.1145/3573128.3609340

Hrishikesh Kulkarni, Zachary Young, Nazli Goharian, O. Frieder, Sean MacAvaney

Documents come in all shapes and sizes and are created by many different means, including now-a-days, generative language models. We demonstrate that a simple genetic algorithm can improve generative information retrieval by using a document's text as a genetic representation, a relevance model as a fitness function, and a large language model as a genetic operator that introduces diversity through random changes to the text to produce new documents. By "mutating" highly-relevant documents and "crossing over" content between documents, we produce new documents of greater relevance to a user's information need --- validated in terms of estimated relevance scores from various models and via a preliminary human evaluation. We also identify challenges that demand further study.

文档有各种形状和大小，并且通过许多不同的方式创建，包括现在的生成语言模型。我们证明了一个简单的遗传算法可以通过使用文档文本作为遗传表示，将关联模型作为适应度函数，将大型语言模型作为遗传算子，通过随机更改文本来引入多样性以产生新文档，从而改进生成信息检索。通过“突变”高度相关的文档和“跨越”文档之间的内容，我们产生了与用户信息需求更相关的新文档——根据各种模型的估计相关性得分和通过初步的人工评估进行验证。我们还确定了需要进一步研究的挑战。

引用次数: 0

Using YOLO Network for Automatic Processing of Finite Automata Images with Application to Bit-Strings Recognition 基于YOLO网络的有限自动机图像自动处理及其在位串识别中的应用

Proceedings of the ACM Symposium on Document Engineering 2023

Pub Date : 2023-08-22 DOI: 10.1145/3573128.3604898

Daniela S. Costa, C. Mello

The recognition of handwritten diagrams has drawn attention in recent years because of their potential applications in many areas, especially when it can be used for educational purposes. Although there are many online approaches, the advances of deep object detector networks have made offline recognition an attractive option, allowing simple inputs such as paper-drawn diagrams. In this paper, we have tested the YOLO network, including its version with fewer parameters, YOLO-Tiny, for the recognition of images of finite automata. This recognition was applied to the development of an application that recognizes bit-strings used as input to the automaton: given an image of a transition diagram, the user inserts a sequence of bits and the system analyzes whether the automaton recognizes the sequence or not. Using two bases of finite automata, we have evaluated the detection and recognition of finite automata symbols as well as bit-string processing. With regard to the diagram symbol detection task, experiments on a handwritten finite automata image dataset returned 82.04% and 97.20% for average precision and recall, respectively.

近年来，手写图表的识别由于其在许多领域的潜在应用而引起了人们的关注，特别是当它可以用于教育目的时。尽管有许多在线方法，但深度目标检测器网络的进步使离线识别成为一个有吸引力的选择，允许简单的输入，如纸上绘制的图表。在本文中，我们测试了YOLO网络及其参数较少的版本YOLO- tiny，用于有限自动机图像的识别。这种识别被应用于一个应用程序的开发，该应用程序识别作为自动机输入的位串:给定一个转换图的图像，用户插入一个位序列，系统分析自动机是否识别该序列。利用有限自动机的两个基础，我们评估了有限自动机符号的检测和识别以及位串处理。对于图符号检测任务，在手写有限自动机图像数据集上的实验，平均准确率和召回率分别为82.04%和97.20%。

{"title":"Using YOLO Network for Automatic Processing of Finite Automata Images with Application to Bit-Strings Recognition","authors":"Daniela S. Costa, C. Mello","doi":"10.1145/3573128.3604898","DOIUrl":"https://doi.org/10.1145/3573128.3604898","url":null,"abstract":"The recognition of handwritten diagrams has drawn attention in recent years because of their potential applications in many areas, especially when it can be used for educational purposes. Although there are many online approaches, the advances of deep object detector networks have made offline recognition an attractive option, allowing simple inputs such as paper-drawn diagrams. In this paper, we have tested the YOLO network, including its version with fewer parameters, YOLO-Tiny, for the recognition of images of finite automata. This recognition was applied to the development of an application that recognizes bit-strings used as input to the automaton: given an image of a transition diagram, the user inserts a sequence of bits and the system analyzes whether the automaton recognizes the sequence or not. Using two bases of finite automata, we have evaluated the detection and recognition of finite automata symbols as well as bit-string processing. With regard to the diagram symbol detection task, experiments on a handwritten finite automata image dataset returned 82.04% and 97.20% for average precision and recall, respectively.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129828750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Layout Analysis of Historic Architectural Program Documents 历史建筑规划文件布局分析

Proceedings of the ACM Symposium on Document Engineering 2023

Pub Date : 2023-08-22 DOI: 10.1145/3573128.3609339

A. Oliaee, Andrew R. Tripp

In this paper, we introduce and make publicly available the CRS Visual Dataset, a new dataset consisting of 7,029 pages of human-annotated and validated scanned archival documents from the field of 20th-century architectural programming; and ArcLayNet, a fine-tuned machine learning model based on the YOLOv6-S object detection architecture. Architectural programming is an essential professional service in the Architecture, Engineering, Construction, and Operations (AECO) Industry, and the documents it produces are powerful instruments of this service. The documents in this dataset are the product of a creative process; they exhibit a variety of sizes, orientations, arrangements, and modes of content, and are underrepresented in current datasets. This paper describes the dataset and narrates an iterative process of quality control in which several deficiencies were identified and addressed to improve the performance of the model. In this process, our key performance indicators, mAP@0.5 and mAP@0.5:0.95, both improved by approximately 10%.

在本文中，我们介绍并公开了CRS可视化数据集，这是一个新的数据集，由来自20世纪建筑规划领域的7,029页人工注释和验证的扫描档案文件组成;ArcLayNet是一种基于YOLOv6-S目标检测架构的微调机器学习模型。架构编程是架构、工程、建设和运营(AECO)行业中必不可少的专业服务，它生成的文档是这项服务的有力工具。这个数据集中的文档是一个创造性过程的产物;它们展示了各种大小、方向、排列和内容模式，并且在当前数据集中未被充分表示。本文描述了数据集，并叙述了质量控制的迭代过程，其中发现并解决了几个缺陷，以提高模型的性能。在此过程中，我们的关键绩效指标mAP@0.5和mAP@0.5:0.95都提高了约10%。

引用次数: 0

A document format for sewing patterns 一种用于缝制图案的文件格式

Proceedings of the ACM Symposium on Document Engineering 2023

Pub Date : 2023-08-22 DOI: 10.1145/3573128.3609353

Charlotte Curtis

Sewing patterns are a form of technical document, requiring expertise to author and understand. Digital patterns are typically produced and sold as PDFs with human-interpretable vector graphics, but there is little consistency or machine-readable metadata in these documents. A custom file format would enable digital pattern manipulation tools to enhance or replace a paper based workflow. In this vision paper, basic sewing pattern components and modification processes are introduced, and the limitations of current PDF patterns are highlighted. Next, an XML-based sewing pattern document format is proposed to take advantage of the inherent relationships between different pattern components. Finally, document security and authenticity considerations are discussed.

缝纫图案是一种技术文档，需要专业知识来撰写和理解。数字模式通常以pdf的形式生产和销售，带有人类可解释的矢量图形，但是这些文档中几乎没有一致性或机器可读的元数据。自定义文件格式将使数字模式操作工具能够增强或取代基于纸张的工作流程。在这篇远景论文中，介绍了基本的缝纫图案组成和修改过程，并强调了当前PDF图案的局限性。其次，提出了一种基于xml的缝纫图案文档格式，以利用不同图案组件之间的内在关系。最后，讨论了文档的安全性和真实性。

引用次数: 0

Algorithm Parallelism for Improved Extractive Summarization 改进的抽取摘要算法并行性

Proceedings of the ACM Symposium on Document Engineering 2023

Pub Date : 2023-08-22 DOI: 10.1145/3573128.3609350

Arturo N. Villanueva, S. Simske

While much work on abstractive summarization has been conducted in recent years, including state-of-the-art summarizations from GPT-4, extractive summarization's lossless nature continues to provide advantages, preserving the style and often key phrases of the original text as meant by the author. Libraries for extractive summarization abound, with a wide range of efficacy. Some do not perform much better or perform even worse than random sampling of sentences extracted from the original text. This study breathes new life to using classical algorithms by proposing parallelism through an implementation of a second order meta-algorithm in the form of the Tessellation and Recombination with Expert Decisioner (T&R) pattern, taking advantage of the abundance of already-existing algorithms and dissociating their individual performance from the implementer's biases. Resulting summaries obtained using T&R are better than any of the component algorithms.

虽然近年来进行了许多关于抽象摘要的工作，包括来自GPT-4的最先进的摘要，但提取摘要的无损性质继续提供优势，保留了作者所指的原始文本的风格和通常的关键短语。用于提取摘要的库大量存在，具有广泛的功效。有些并不比从原文中随机抽取的句子表现得更好，甚至表现得更差。本研究通过以专家决策者(T&R)模式的镶嵌和重组形式的二阶元算法的实现提出并行性，利用丰富的现有算法并将其个人性能与实现者的偏见分离开来，从而为使用经典算法注入了新的活力。使用T&R获得的结果摘要比任何组件算法都要好。

引用次数: 0

YinYang, a Fast and Robust Adaptive Document Image Binarization for Optical Character Recognition 一种用于光学字符识别的快速鲁棒自适应文档图像二值化方法

Proceedings of the ACM Symposium on Document Engineering 2023

Pub Date : 2023-08-22 DOI: 10.1145/3573128.3609354

Jean-Luc Bloechle, J. Hennebert, Christophe Gisler

Optical Character Recognition (OCR) from document photos taken by cell phones is a challenging task. Most OCR methods require prior binarization of the image, which can be difficult to achieve when documents are captured with various mobile devices in unknown lighting conditions. For example, shadows cast by the camera or the camera holder on a hard copy can jeopardize the binarization process and hinder the next OCR step. In the case of highly uneven illumination, binarization methods using global thresholding simply fail, and state-of-the-art adaptive algorithms often deliver unsatisfactory results. In this paper, we present a new binarization algorithm using two complementary local adaptive passes and taking advantage of the color components to improve results over current image binarization methods. The proposed approach gave remarkable results at the DocEng'22 competition on the binarization of photographed documents.

手机拍摄的文件照片的光学字符识别(OCR)是一项具有挑战性的任务。大多数OCR方法需要对图像进行事先二值化，当在未知光照条件下使用各种移动设备捕获文档时，这很难实现。例如，相机或相机支架在硬拷贝上的阴影可能会危及二值化过程并阻碍下一个OCR步骤。在光照高度不均匀的情况下，使用全局阈值的二值化方法会失败，而最先进的自适应算法通常会提供令人不满意的结果。在本文中，我们提出了一种新的二值化算法，该算法使用两个互补的局部自适应通道，并利用颜色分量来改善当前图像二值化方法的结果。该方法在DocEng’22摄影文档二值化竞赛中取得了显著的成果。

引用次数: 0

Automatically Labeling Cyber Threat Intelligence reports using Natural Language Processing 使用自然语言处理自动标记网络威胁情报报告

Proceedings of the ACM Symposium on Document Engineering 2023

Pub Date : 2023-08-22 DOI: 10.1145/3573128.3609348

Hamza Abdi, S. Bagley, S. Furnell, J. Twycross

Attribution provides valuable intelligence in the face of Advanced Persistent Threat (APT) attacks. By accurately identifying the culprits and actors behind the attacks, we can gain more insights into their motivations, capabilities, and potential future targets. Cyber Threat Intelligence (CTI) reports are relied upon to attribute these attacks effectively. These reports are compiled by security experts and provide valuable information about threat actors and their attacks. We are interested in building a fully automated APT attribution framework. An essential step in doing so is the automated processing and extraction of information from CTI reports. However, CTI reports are largely unstructured, making extraction and analysis of the information a difficult task. To begin this work, we introduce a method for automatically highlighting a CTI report with the main threat actor attributed within the report. This is done using a custom Natural Language Processing (NLP) model based on the spaCy library. Also, the study showcases and highlights the performance and effectiveness of various pdf-to-text Python libraries that were used in this work. Additionally, to evaluate the effectiveness of our model, we experimented on a dataset consisting of 605 English documents, which were randomly collected from various sources on the internet and manually labeled. Our method achieved an accuracy of 97%. Finally, we discuss the challenges associated with processing these documents automatically and propose some methods for tackling them.

在面对高级持续性威胁(APT)攻击时，归因提供了有价值的情报。通过准确识别攻击背后的罪魁祸首和参与者，我们可以更深入地了解他们的动机、能力和潜在的未来目标。网络威胁情报(CTI)报告可以有效地归因于这些攻击。这些报告由安全专家编写，并提供有关威胁参与者及其攻击的宝贵信息。我们有兴趣建立一个完全自动化的APT归因框架。这样做的一个重要步骤是从CTI报告中自动处理和提取信息。然而，CTI报告在很大程度上是非结构化的，使得信息的提取和分析成为一项困难的任务。为了开始这项工作，我们引入了一种方法，用于自动突出显示CTI报告，其中包含报告中属性的主要威胁参与者。这是使用基于spaCy库的自定义自然语言处理(NLP)模型完成的。此外，该研究还展示并强调了在这项工作中使用的各种pdf-to-text Python库的性能和有效性。此外，为了评估我们模型的有效性，我们在一个由605个英文文档组成的数据集上进行了实验，这些文档从互联网上的各种来源随机收集并手动标记。我们的方法达到了97%的准确率。最后，我们讨论了与自动处理这些文档相关的挑战，并提出了一些解决这些挑战的方法。

{"title":"Automatically Labeling Cyber Threat Intelligence reports using Natural Language Processing","authors":"Hamza Abdi, S. Bagley, S. Furnell, J. Twycross","doi":"10.1145/3573128.3609348","DOIUrl":"https://doi.org/10.1145/3573128.3609348","url":null,"abstract":"Attribution provides valuable intelligence in the face of Advanced Persistent Threat (APT) attacks. By accurately identifying the culprits and actors behind the attacks, we can gain more insights into their motivations, capabilities, and potential future targets. Cyber Threat Intelligence (CTI) reports are relied upon to attribute these attacks effectively. These reports are compiled by security experts and provide valuable information about threat actors and their attacks. We are interested in building a fully automated APT attribution framework. An essential step in doing so is the automated processing and extraction of information from CTI reports. However, CTI reports are largely unstructured, making extraction and analysis of the information a difficult task. To begin this work, we introduce a method for automatically highlighting a CTI report with the main threat actor attributed within the report. This is done using a custom Natural Language Processing (NLP) model based on the spaCy library. Also, the study showcases and highlights the performance and effectiveness of various pdf-to-text Python libraries that were used in this work. Additionally, to evaluate the effectiveness of our model, we experimented on a dataset consisting of 605 English documents, which were randomly collected from various sources on the internet and manually labeled. Our method achieved an accuracy of 97%. Finally, we discuss the challenges associated with processing these documents automatically and propose some methods for tackling them.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116422876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Privacy Lost and Found: An Investigation at Scale of Web Privacy Policy Availability 隐私丢失和发现:网络隐私政策可用性的规模调查

Proceedings of the ACM Symposium on Document Engineering 2023

Pub Date : 2023-08-22 DOI: 10.1145/3573128.3604902

Mukund Srinath, S. Sundareswara, Pranav Narayanan Venkit, C. Giles, Shomir Wilson

Legal jurisdictions around the world require organisations to post privacy policies on their websites. However, in spite of laws such as GDPR and CCPA reinforcing this requirement, organisations sometimes do not comply, and a variety of semi-compliant failure modes exist. To investigate the landscape of web privacy policies, we crawl the privacy policies from 7 million organisation websites with the goal of identifying when policies are unavailable. We conduct a large-scale investigation of the availability of privacy policies and identify potential reasons for unavailability such as dead links, documents with empty content, documents that consist solely of placeholder text, and documents unavailable in the specific languages offered by their respective websites. We estimate the frequencies of these failure modes and the overall unavailability of privacy policies on the web and find that privacy policies URLs are only available in 34% of websites. Further, 1.37% of these URLs are broken links and 1.23% of the valid links lead to pages without a policy. Further, to enable investigation of privacy policies at scale, we use the capture-recapture technique to estimate the total number of English language privacy policies on the web and the distribution of these documents across top level domains and sectors of commerce. We estimate the lower bound on the number of English language privacy policies to be around 3 million. Finally, we release the CoLIPPs Corpus containing around 600k policies and their metadata consisting of policy URL, length, readability, sector of commerce, and policy crawl date.

世界各地的司法管辖区都要求组织在其网站上发布隐私政策。然而，尽管GDPR和CCPA等法律加强了这一要求，但组织有时并不遵守，并且存在各种半合规的失败模式。为了调查网络隐私政策的现状，我们从700万个组织网站中抓取隐私政策，目的是确定政策何时不可用。我们对隐私政策的可用性进行了大规模调查，并确定了不可用的潜在原因，如死链接、内容为空的文档、仅由占位符文本组成的文档以及各自网站提供的特定语言不可用的文档。我们估计了这些故障模式的频率和网络上隐私策略的总体不可用性，发现隐私策略url仅在34%的网站中可用。此外，1.37%的url是失效链接，1.23%的有效链接指向没有策略的页面。此外，为了能够大规模地调查隐私政策，我们使用捕获-重新捕获技术来估计网络上英语隐私政策的总数以及这些文档在顶级域和商业部门之间的分布。我们估计英语隐私政策数量的下限在300万左右。最后，我们发布了CoLIPPs语料库，其中包含大约60万个策略及其元数据，包括策略URL、长度、可读性、商业部门和策略抓取日期。

{"title":"Privacy Lost and Found: An Investigation at Scale of Web Privacy Policy Availability","authors":"Mukund Srinath, S. Sundareswara, Pranav Narayanan Venkit, C. Giles, Shomir Wilson","doi":"10.1145/3573128.3604902","DOIUrl":"https://doi.org/10.1145/3573128.3604902","url":null,"abstract":"Legal jurisdictions around the world require organisations to post privacy policies on their websites. However, in spite of laws such as GDPR and CCPA reinforcing this requirement, organisations sometimes do not comply, and a variety of semi-compliant failure modes exist. To investigate the landscape of web privacy policies, we crawl the privacy policies from 7 million organisation websites with the goal of identifying when policies are unavailable. We conduct a large-scale investigation of the availability of privacy policies and identify potential reasons for unavailability such as dead links, documents with empty content, documents that consist solely of placeholder text, and documents unavailable in the specific languages offered by their respective websites. We estimate the frequencies of these failure modes and the overall unavailability of privacy policies on the web and find that privacy policies URLs are only available in 34% of websites. Further, 1.37% of these URLs are broken links and 1.23% of the valid links lead to pages without a policy. Further, to enable investigation of privacy policies at scale, we use the capture-recapture technique to estimate the total number of English language privacy policies on the web and the distribution of these documents across top level domains and sectors of commerce. We estimate the lower bound on the number of English language privacy policies to be around 3 million. Finally, we release the CoLIPPs Corpus containing around 600k policies and their metadata consisting of policy URL, length, readability, sector of commerce, and policy crawl date.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126859725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the ACM Symposium on Document Engineering 2023

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀