Eric M. Cabral, Sima Rezaeipourfarsangi, Maria Cristina Ferreira de Oliveira, E. Milios, R. Minghim
This paper presents MOD-kt, a modular framework designed to bridge the gap between modern language models and key-term-based document clustering. One of the main challenges of using neural language models for key-term-based clustering is the mismatch between the interpretability of the underlying document representation (i.e. document embeddings) and the more intuitive semantic elements that allow the user to guide the clustering process (i.e. key-terms). Our framework acts as a communication layer between word and document models, enabling key-term-based clustering in the context of document and word models with a flexible and adaptable architecture. We report a comparison of the performance of multiple neural language models on clustering, considering a selected range of relevance metrics. Additionally, a qualitative user study was conducted to illustrate the framework's potential for intuitive user-guided quality clustering of document collections.
{"title":"Addressing the gap between current language models and key-term-based clustering","authors":"Eric M. Cabral, Sima Rezaeipourfarsangi, Maria Cristina Ferreira de Oliveira, E. Milios, R. Minghim","doi":"10.1145/3573128.3604900","DOIUrl":"https://doi.org/10.1145/3573128.3604900","url":null,"abstract":"This paper presents MOD-kt, a modular framework designed to bridge the gap between modern language models and key-term-based document clustering. One of the main challenges of using neural language models for key-term-based clustering is the mismatch between the interpretability of the underlying document representation (i.e. document embeddings) and the more intuitive semantic elements that allow the user to guide the clustering process (i.e. key-terms). Our framework acts as a communication layer between word and document models, enabling key-term-based clustering in the context of document and word models with a flexible and adaptable architecture. We report a comparison of the performance of multiple neural language models on clustering, considering a selected range of relevance metrics. Additionally, a qualitative user study was conducted to illustrate the framework's potential for intuitive user-guided quality clustering of document collections.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115139492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tables, ubiquitous in data-oriented documents like scientific papers and financial statements, organize and convey relational information. Automatic table recognition from document images, which involves detection within the page, structural segmentation into rows, columns, and cells, and information extraction from cells, has been a popular research topic in document image analysis (DIA). With recent advances in natural language generation (NLG) based on deep neural networks, data-to-text generation, in particular for table summarization, offers interesting solutions to time-intensive data analysis. In this paper, we aim to bridge the gap between efforts in DIA and NLG regarding tabular data: we propose WEATHERGOV+, a dataset building upon the WEATHERGOV dataset, the standard for tabular data summarization techniques, that allows for the training and testing of end-to-end methods working from input document images to generate text summaries as output. WEATHERGOV+ contains images of tables created from the tabular data of WEATHERGOV using visual variations that cover various levels of difficulty, along with the corresponding human-generated table summaries of WEATHERGOV. We also propose an end-to-end pipeline that compares state-of-the-art table recognition methods for summarization purposes. We analyse the results of the proposed pipeline by evaluating WEATHERGOV+ at each stage of the pipeline to identify the effects of error propagation and the weaknesses of the current methods, such as OCR errors. With this research (dataset and code available here1), we hope to encourage new research for the processing and management of inter- and intra-document collections.
{"title":"WEATHERGOV+: A Table Recognition and Summarization Dataset to Bridge the Gap Between Document Image Analysis and Natural Language Generation","authors":"Amanda Dash, Melissa Cote, A. Albu","doi":"10.1145/3573128.3604901","DOIUrl":"https://doi.org/10.1145/3573128.3604901","url":null,"abstract":"Tables, ubiquitous in data-oriented documents like scientific papers and financial statements, organize and convey relational information. Automatic table recognition from document images, which involves detection within the page, structural segmentation into rows, columns, and cells, and information extraction from cells, has been a popular research topic in document image analysis (DIA). With recent advances in natural language generation (NLG) based on deep neural networks, data-to-text generation, in particular for table summarization, offers interesting solutions to time-intensive data analysis. In this paper, we aim to bridge the gap between efforts in DIA and NLG regarding tabular data: we propose WEATHERGOV+, a dataset building upon the WEATHERGOV dataset, the standard for tabular data summarization techniques, that allows for the training and testing of end-to-end methods working from input document images to generate text summaries as output. WEATHERGOV+ contains images of tables created from the tabular data of WEATHERGOV using visual variations that cover various levels of difficulty, along with the corresponding human-generated table summaries of WEATHERGOV. We also propose an end-to-end pipeline that compares state-of-the-art table recognition methods for summarization purposes. We analyse the results of the proposed pipeline by evaluating WEATHERGOV+ at each stage of the pipeline to identify the effects of error propagation and the weaknesses of the current methods, such as OCR errors. With this research (dataset and code available here1), we hope to encourage new research for the processing and management of inter- and intra-document collections.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"35 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126776820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hrishikesh Kulkarni, Zachary Young, Nazli Goharian, O. Frieder, Sean MacAvaney
Documents come in all shapes and sizes and are created by many different means, including now-a-days, generative language models. We demonstrate that a simple genetic algorithm can improve generative information retrieval by using a document's text as a genetic representation, a relevance model as a fitness function, and a large language model as a genetic operator that introduces diversity through random changes to the text to produce new documents. By "mutating" highly-relevant documents and "crossing over" content between documents, we produce new documents of greater relevance to a user's information need --- validated in terms of estimated relevance scores from various models and via a preliminary human evaluation. We also identify challenges that demand further study.
{"title":"Genetic Generative Information Retrieval","authors":"Hrishikesh Kulkarni, Zachary Young, Nazli Goharian, O. Frieder, Sean MacAvaney","doi":"10.1145/3573128.3609340","DOIUrl":"https://doi.org/10.1145/3573128.3609340","url":null,"abstract":"Documents come in all shapes and sizes and are created by many different means, including now-a-days, generative language models. We demonstrate that a simple genetic algorithm can improve generative information retrieval by using a document's text as a genetic representation, a relevance model as a fitness function, and a large language model as a genetic operator that introduces diversity through random changes to the text to produce new documents. By \"mutating\" highly-relevant documents and \"crossing over\" content between documents, we produce new documents of greater relevance to a user's information need --- validated in terms of estimated relevance scores from various models and via a preliminary human evaluation. We also identify challenges that demand further study.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134322924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The recognition of handwritten diagrams has drawn attention in recent years because of their potential applications in many areas, especially when it can be used for educational purposes. Although there are many online approaches, the advances of deep object detector networks have made offline recognition an attractive option, allowing simple inputs such as paper-drawn diagrams. In this paper, we have tested the YOLO network, including its version with fewer parameters, YOLO-Tiny, for the recognition of images of finite automata. This recognition was applied to the development of an application that recognizes bit-strings used as input to the automaton: given an image of a transition diagram, the user inserts a sequence of bits and the system analyzes whether the automaton recognizes the sequence or not. Using two bases of finite automata, we have evaluated the detection and recognition of finite automata symbols as well as bit-string processing. With regard to the diagram symbol detection task, experiments on a handwritten finite automata image dataset returned 82.04% and 97.20% for average precision and recall, respectively.
{"title":"Using YOLO Network for Automatic Processing of Finite Automata Images with Application to Bit-Strings Recognition","authors":"Daniela S. Costa, C. Mello","doi":"10.1145/3573128.3604898","DOIUrl":"https://doi.org/10.1145/3573128.3604898","url":null,"abstract":"The recognition of handwritten diagrams has drawn attention in recent years because of their potential applications in many areas, especially when it can be used for educational purposes. Although there are many online approaches, the advances of deep object detector networks have made offline recognition an attractive option, allowing simple inputs such as paper-drawn diagrams. In this paper, we have tested the YOLO network, including its version with fewer parameters, YOLO-Tiny, for the recognition of images of finite automata. This recognition was applied to the development of an application that recognizes bit-strings used as input to the automaton: given an image of a transition diagram, the user inserts a sequence of bits and the system analyzes whether the automaton recognizes the sequence or not. Using two bases of finite automata, we have evaluated the detection and recognition of finite automata symbols as well as bit-string processing. With regard to the diagram symbol detection task, experiments on a handwritten finite automata image dataset returned 82.04% and 97.20% for average precision and recall, respectively.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129828750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we introduce and make publicly available the CRS Visual Dataset, a new dataset consisting of 7,029 pages of human-annotated and validated scanned archival documents from the field of 20th-century architectural programming; and ArcLayNet, a fine-tuned machine learning model based on the YOLOv6-S object detection architecture. Architectural programming is an essential professional service in the Architecture, Engineering, Construction, and Operations (AECO) Industry, and the documents it produces are powerful instruments of this service. The documents in this dataset are the product of a creative process; they exhibit a variety of sizes, orientations, arrangements, and modes of content, and are underrepresented in current datasets. This paper describes the dataset and narrates an iterative process of quality control in which several deficiencies were identified and addressed to improve the performance of the model. In this process, our key performance indicators, mAP@0.5 and mAP@0.5:0.95, both improved by approximately 10%.
{"title":"Layout Analysis of Historic Architectural Program Documents","authors":"A. Oliaee, Andrew R. Tripp","doi":"10.1145/3573128.3609339","DOIUrl":"https://doi.org/10.1145/3573128.3609339","url":null,"abstract":"In this paper, we introduce and make publicly available the CRS Visual Dataset, a new dataset consisting of 7,029 pages of human-annotated and validated scanned archival documents from the field of 20th-century architectural programming; and ArcLayNet, a fine-tuned machine learning model based on the YOLOv6-S object detection architecture. Architectural programming is an essential professional service in the Architecture, Engineering, Construction, and Operations (AECO) Industry, and the documents it produces are powerful instruments of this service. The documents in this dataset are the product of a creative process; they exhibit a variety of sizes, orientations, arrangements, and modes of content, and are underrepresented in current datasets. This paper describes the dataset and narrates an iterative process of quality control in which several deficiencies were identified and addressed to improve the performance of the model. In this process, our key performance indicators, mAP@0.5 and mAP@0.5:0.95, both improved by approximately 10%.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127135789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sewing patterns are a form of technical document, requiring expertise to author and understand. Digital patterns are typically produced and sold as PDFs with human-interpretable vector graphics, but there is little consistency or machine-readable metadata in these documents. A custom file format would enable digital pattern manipulation tools to enhance or replace a paper based workflow. In this vision paper, basic sewing pattern components and modification processes are introduced, and the limitations of current PDF patterns are highlighted. Next, an XML-based sewing pattern document format is proposed to take advantage of the inherent relationships between different pattern components. Finally, document security and authenticity considerations are discussed.
{"title":"A document format for sewing patterns","authors":"Charlotte Curtis","doi":"10.1145/3573128.3609353","DOIUrl":"https://doi.org/10.1145/3573128.3609353","url":null,"abstract":"Sewing patterns are a form of technical document, requiring expertise to author and understand. Digital patterns are typically produced and sold as PDFs with human-interpretable vector graphics, but there is little consistency or machine-readable metadata in these documents. A custom file format would enable digital pattern manipulation tools to enhance or replace a paper based workflow. In this vision paper, basic sewing pattern components and modification processes are introduced, and the limitations of current PDF patterns are highlighted. Next, an XML-based sewing pattern document format is proposed to take advantage of the inherent relationships between different pattern components. Finally, document security and authenticity considerations are discussed.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129997035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While much work on abstractive summarization has been conducted in recent years, including state-of-the-art summarizations from GPT-4, extractive summarization's lossless nature continues to provide advantages, preserving the style and often key phrases of the original text as meant by the author. Libraries for extractive summarization abound, with a wide range of efficacy. Some do not perform much better or perform even worse than random sampling of sentences extracted from the original text. This study breathes new life to using classical algorithms by proposing parallelism through an implementation of a second order meta-algorithm in the form of the Tessellation and Recombination with Expert Decisioner (T&R) pattern, taking advantage of the abundance of already-existing algorithms and dissociating their individual performance from the implementer's biases. Resulting summaries obtained using T&R are better than any of the component algorithms.
{"title":"Algorithm Parallelism for Improved Extractive Summarization","authors":"Arturo N. Villanueva, S. Simske","doi":"10.1145/3573128.3609350","DOIUrl":"https://doi.org/10.1145/3573128.3609350","url":null,"abstract":"While much work on abstractive summarization has been conducted in recent years, including state-of-the-art summarizations from GPT-4, extractive summarization's lossless nature continues to provide advantages, preserving the style and often key phrases of the original text as meant by the author. Libraries for extractive summarization abound, with a wide range of efficacy. Some do not perform much better or perform even worse than random sampling of sentences extracted from the original text. This study breathes new life to using classical algorithms by proposing parallelism through an implementation of a second order meta-algorithm in the form of the Tessellation and Recombination with Expert Decisioner (T&R) pattern, taking advantage of the abundance of already-existing algorithms and dissociating their individual performance from the implementer's biases. Resulting summaries obtained using T&R are better than any of the component algorithms.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"15 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130757602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jean-Luc Bloechle, J. Hennebert, Christophe Gisler
Optical Character Recognition (OCR) from document photos taken by cell phones is a challenging task. Most OCR methods require prior binarization of the image, which can be difficult to achieve when documents are captured with various mobile devices in unknown lighting conditions. For example, shadows cast by the camera or the camera holder on a hard copy can jeopardize the binarization process and hinder the next OCR step. In the case of highly uneven illumination, binarization methods using global thresholding simply fail, and state-of-the-art adaptive algorithms often deliver unsatisfactory results. In this paper, we present a new binarization algorithm using two complementary local adaptive passes and taking advantage of the color components to improve results over current image binarization methods. The proposed approach gave remarkable results at the DocEng'22 competition on the binarization of photographed documents.
{"title":"YinYang, a Fast and Robust Adaptive Document Image Binarization for Optical Character Recognition","authors":"Jean-Luc Bloechle, J. Hennebert, Christophe Gisler","doi":"10.1145/3573128.3609354","DOIUrl":"https://doi.org/10.1145/3573128.3609354","url":null,"abstract":"Optical Character Recognition (OCR) from document photos taken by cell phones is a challenging task. Most OCR methods require prior binarization of the image, which can be difficult to achieve when documents are captured with various mobile devices in unknown lighting conditions. For example, shadows cast by the camera or the camera holder on a hard copy can jeopardize the binarization process and hinder the next OCR step. In the case of highly uneven illumination, binarization methods using global thresholding simply fail, and state-of-the-art adaptive algorithms often deliver unsatisfactory results. In this paper, we present a new binarization algorithm using two complementary local adaptive passes and taking advantage of the color components to improve results over current image binarization methods. The proposed approach gave remarkable results at the DocEng'22 competition on the binarization of photographed documents.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116143366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Attribution provides valuable intelligence in the face of Advanced Persistent Threat (APT) attacks. By accurately identifying the culprits and actors behind the attacks, we can gain more insights into their motivations, capabilities, and potential future targets. Cyber Threat Intelligence (CTI) reports are relied upon to attribute these attacks effectively. These reports are compiled by security experts and provide valuable information about threat actors and their attacks. We are interested in building a fully automated APT attribution framework. An essential step in doing so is the automated processing and extraction of information from CTI reports. However, CTI reports are largely unstructured, making extraction and analysis of the information a difficult task. To begin this work, we introduce a method for automatically highlighting a CTI report with the main threat actor attributed within the report. This is done using a custom Natural Language Processing (NLP) model based on the spaCy library. Also, the study showcases and highlights the performance and effectiveness of various pdf-to-text Python libraries that were used in this work. Additionally, to evaluate the effectiveness of our model, we experimented on a dataset consisting of 605 English documents, which were randomly collected from various sources on the internet and manually labeled. Our method achieved an accuracy of 97%. Finally, we discuss the challenges associated with processing these documents automatically and propose some methods for tackling them.
{"title":"Automatically Labeling Cyber Threat Intelligence reports using Natural Language Processing","authors":"Hamza Abdi, S. Bagley, S. Furnell, J. Twycross","doi":"10.1145/3573128.3609348","DOIUrl":"https://doi.org/10.1145/3573128.3609348","url":null,"abstract":"Attribution provides valuable intelligence in the face of Advanced Persistent Threat (APT) attacks. By accurately identifying the culprits and actors behind the attacks, we can gain more insights into their motivations, capabilities, and potential future targets. Cyber Threat Intelligence (CTI) reports are relied upon to attribute these attacks effectively. These reports are compiled by security experts and provide valuable information about threat actors and their attacks. We are interested in building a fully automated APT attribution framework. An essential step in doing so is the automated processing and extraction of information from CTI reports. However, CTI reports are largely unstructured, making extraction and analysis of the information a difficult task. To begin this work, we introduce a method for automatically highlighting a CTI report with the main threat actor attributed within the report. This is done using a custom Natural Language Processing (NLP) model based on the spaCy library. Also, the study showcases and highlights the performance and effectiveness of various pdf-to-text Python libraries that were used in this work. Additionally, to evaluate the effectiveness of our model, we experimented on a dataset consisting of 605 English documents, which were randomly collected from various sources on the internet and manually labeled. Our method achieved an accuracy of 97%. Finally, we discuss the challenges associated with processing these documents automatically and propose some methods for tackling them.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116422876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mukund Srinath, S. Sundareswara, Pranav Narayanan Venkit, C. Giles, Shomir Wilson
Legal jurisdictions around the world require organisations to post privacy policies on their websites. However, in spite of laws such as GDPR and CCPA reinforcing this requirement, organisations sometimes do not comply, and a variety of semi-compliant failure modes exist. To investigate the landscape of web privacy policies, we crawl the privacy policies from 7 million organisation websites with the goal of identifying when policies are unavailable. We conduct a large-scale investigation of the availability of privacy policies and identify potential reasons for unavailability such as dead links, documents with empty content, documents that consist solely of placeholder text, and documents unavailable in the specific languages offered by their respective websites. We estimate the frequencies of these failure modes and the overall unavailability of privacy policies on the web and find that privacy policies URLs are only available in 34% of websites. Further, 1.37% of these URLs are broken links and 1.23% of the valid links lead to pages without a policy. Further, to enable investigation of privacy policies at scale, we use the capture-recapture technique to estimate the total number of English language privacy policies on the web and the distribution of these documents across top level domains and sectors of commerce. We estimate the lower bound on the number of English language privacy policies to be around 3 million. Finally, we release the CoLIPPs Corpus containing around 600k policies and their metadata consisting of policy URL, length, readability, sector of commerce, and policy crawl date.
{"title":"Privacy Lost and Found: An Investigation at Scale of Web Privacy Policy Availability","authors":"Mukund Srinath, S. Sundareswara, Pranav Narayanan Venkit, C. Giles, Shomir Wilson","doi":"10.1145/3573128.3604902","DOIUrl":"https://doi.org/10.1145/3573128.3604902","url":null,"abstract":"Legal jurisdictions around the world require organisations to post privacy policies on their websites. However, in spite of laws such as GDPR and CCPA reinforcing this requirement, organisations sometimes do not comply, and a variety of semi-compliant failure modes exist. To investigate the landscape of web privacy policies, we crawl the privacy policies from 7 million organisation websites with the goal of identifying when policies are unavailable. We conduct a large-scale investigation of the availability of privacy policies and identify potential reasons for unavailability such as dead links, documents with empty content, documents that consist solely of placeholder text, and documents unavailable in the specific languages offered by their respective websites. We estimate the frequencies of these failure modes and the overall unavailability of privacy policies on the web and find that privacy policies URLs are only available in 34% of websites. Further, 1.37% of these URLs are broken links and 1.23% of the valid links lead to pages without a policy. Further, to enable investigation of privacy policies at scale, we use the capture-recapture technique to estimate the total number of English language privacy policies on the web and the distribution of these documents across top level domains and sectors of commerce. We estimate the lower bound on the number of English language privacy policies to be around 3 million. Finally, we release the CoLIPPs Corpus containing around 600k policies and their metadata consisting of policy URL, length, readability, sector of commerce, and policy crawl date.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126859725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}