ACM Transactions on the Web最新文献_第4页

Deep Gated Multi-modal Fusion for Image Privacy Prediction 图像隐私预测的深门控多模态融合

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on the Web

Pub Date : 2023-07-22 DOI: https://dl.acm.org/doi/10.1145/3608446

Chenye Zhao, Cornelia Caragea

With the rapid development of technologies in mobile devices, people can post their daily lives on social networking sites such as Facebook, Flickr, and Instagram. This leads to new privacy concerns due to people’s lack of understanding that private information can be leaked and used to their detriment. Image privacy prediction models are developed to predict whether images contain sensitive information (private images) or are safe to be shared online (public images). Despite significant progress on this task, there are still some crucial problems that remain to be solved. Firstly, images’ content and tags are found to be useful modalities to automatically predict images’ privacy. To date, most image privacy prediction models use single modalities (image-only or tag-only), which limits their performance. Secondly, we observe that current image privacy prediction models are surprisingly vulnerable to even small perturbations in the input data. Attackers can add small perturbations to input data and easily damage a well-trained image privacy prediction model. To address these challenges, in this paper, we propose a new decision-level Gated multi-modal fusion (GMMF) approach that fuses object, scene, and image tags modalities to predict privacy for online images. In particular, the proposed approach identifies fusion weights of class probability distributions generated by single-modal classifiers according to their reliability of the privacy prediction for each target image in a sample-by-sample manner and performs a weighted decision-level fusion, so that modalities with high reliability are assigned with higher fusion weights while ones with low reliability are restrained with lower fusion weights. The results of our experiments show that the gated multi-modal fusion network effectively fuses single modalities and outperforms state-of-the-art models for image privacy prediction. Moreover, we perform adversarial training on our proposed GMMF model using multiple types of noise on input data (i.e., images and/or tags). When some modalities are failed by input data with noise attacks, our approach effectively utilizes clean modalities and minimizes negative influences brought by degraded ones using fusion weights, achieving significantly stronger robustness over traditional fusion methods for image privacy prediction. The robustness of our GMMF model against data noise can even be generalized to more severe noise levels. To the best of our knowledge, we are the first to investigate the robustness of image privacy prediction models against noise attacks. Moreover, as the performance of decision-level multi-modal fusion depends highly on the quality of single-modal networks, we investigate self-distillation on single-modal privacy classifiers and observe that transferring knowledge from a trained teacher model to a student model is beneficial in our proposed approach.

随着移动设备技术的飞速发展，人们可以在Facebook、Flickr、Instagram等社交网站上发布自己的日常生活。这导致了新的隐私问题，因为人们不了解私人信息可能被泄露并被用来损害他们的利益。图像隐私预测模型用于预测图像是否包含敏感信息(私有图像)或是否可以安全地在线共享(公共图像)。尽管这项任务取得了重大进展，但仍有一些关键问题有待解决。首先，发现图像的内容和标签是自动预测图像隐私的有用模式。迄今为止，大多数图像隐私预测模型使用单一模式(仅图像或仅标记)，这限制了它们的性能。其次，我们观察到当前的图像隐私预测模型非常容易受到输入数据中的微小扰动的影响。攻击者可以在输入数据中添加微小的扰动，很容易破坏训练有素的图像隐私预测模型。为了解决这些挑战，在本文中，我们提出了一种新的决策级门控多模态融合(GMMF)方法，该方法融合了对象、场景和图像标签的模态来预测在线图像的隐私。特别是，该方法根据单模态分类器对每个目标图像隐私预测的可靠性，逐样本地识别分类器生成的类概率分布的融合权值，并进行加权决策级融合，使可靠性高的分类器具有较高的融合权值，而可靠性低的分类器具有较低的融合权值。我们的实验结果表明，门控多模态融合网络有效地融合了单一模态，并且优于最先进的图像隐私预测模型。此外，我们使用输入数据(即图像和/或标签)上的多种类型噪声对我们提出的GMMF模型进行对抗性训练。当某些模态被带有噪声攻击的输入数据破坏时，我们的方法有效地利用干净模态，并利用融合权重最小化退化模态带来的负面影响，实现了比传统融合方法更强的图像隐私预测鲁棒性。我们的GMMF模型对数据噪声的鲁棒性甚至可以推广到更严重的噪声水平。据我们所知，我们是第一个研究图像隐私预测模型对噪声攻击的鲁棒性的人。此外，由于决策级多模态融合的性能高度依赖于单模态网络的质量，我们研究了单模态隐私分类器的自蒸馏，并观察到在我们提出的方法中，将知识从训练有素的教师模型转移到学生模型是有益的。

{"title":"Deep Gated Multi-modal Fusion for Image Privacy Prediction","authors":"Chenye Zhao, Cornelia Caragea","doi":"https://dl.acm.org/doi/10.1145/3608446","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3608446","url":null,"abstract":"With the rapid development of technologies in mobile devices, people can post their daily lives on social networking sites such as Facebook, Flickr, and Instagram. This leads to new privacy concerns due to people’s lack of understanding that private information can be leaked and used to their detriment. Image privacy prediction models are developed to predict whether images contain sensitive information (private images) or are safe to be shared online (public images). Despite significant progress on this task, there are still some crucial problems that remain to be solved. Firstly, images’ content and tags are found to be useful modalities to automatically predict images’ privacy. To date, most image privacy prediction models use single modalities (image-only or tag-only), which limits their performance. Secondly, we observe that current image privacy prediction models are surprisingly vulnerable to even small perturbations in the input data. Attackers can add small perturbations to input data and easily damage a well-trained image privacy prediction model. To address these challenges, in this paper, we propose a new decision-level Gated multi-modal fusion (GMMF) approach that fuses object, scene, and image tags modalities to predict privacy for online images. In particular, the proposed approach identifies fusion weights of class probability distributions generated by single-modal classifiers according to their reliability of the privacy prediction for each target image in a sample-by-sample manner and performs a weighted decision-level fusion, so that modalities with high reliability are assigned with higher fusion weights while ones with low reliability are restrained with lower fusion weights. The results of our experiments show that the gated multi-modal fusion network effectively fuses single modalities and outperforms state-of-the-art models for image privacy prediction. Moreover, we perform adversarial training on our proposed GMMF model using multiple types of noise on input data (i.e., images and/or tags). When some modalities are failed by input data with noise attacks, our approach effectively utilizes clean modalities and minimizes negative influences brought by degraded ones using fusion weights, achieving significantly stronger robustness over traditional fusion methods for image privacy prediction. The robustness of our GMMF model against data noise can even be generalized to more severe noise levels. To the best of our knowledge, we are the first to investigate the robustness of image privacy prediction models against noise attacks. Moreover, as the performance of decision-level multi-modal fusion depends highly on the quality of single-modal networks, we investigate self-distillation on single-modal privacy classifiers and observe that transferring knowledge from a trained teacher model to a student model is beneficial in our proposed approach.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":"42 36","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138495120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep Gated Multi-modal Fusion for Image Privacy Prediction 图像隐私预测的深门控多模态融合

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on the Web

Pub Date : 2023-07-22 DOI: 10.1145/3608446

Chenye Zhao, Cornelia Caragea

With the rapid development of technologies in mobile devices, people can post their daily lives on social networking sites such as Facebook, Flickr, and Instagram. This leads to new privacy concerns due to people’s lack of understanding that private information can be leaked and used to their detriment. Image privacy prediction models are developed to predict whether images contain sensitive information (private images) or are safe to be shared online (public images). Despite significant progress on this task, there are still some crucial problems that remain to be solved. Firstly, images’ content and tags are found to be useful modalities to automatically predict images’ privacy. To date, most image privacy prediction models use single modalities (image-only or tag-only), which limits their performance. Secondly, we observe that current image privacy prediction models are surprisingly vulnerable to even small perturbations in the input data. Attackers can add small perturbations to input data and easily damage a well-trained image privacy prediction model. To address these challenges, in this paper, we propose a new decision-level Gated multi-modal fusion (GMMF) approach that fuses object, scene, and image tags modalities to predict privacy for online images. In particular, the proposed approach identifies fusion weights of class probability distributions generated by single-modal classifiers according to their reliability of the privacy prediction for each target image in a sample-by-sample manner and performs a weighted decision-level fusion, so that modalities with high reliability are assigned with higher fusion weights while ones with low reliability are restrained with lower fusion weights. The results of our experiments show that the gated multi-modal fusion network effectively fuses single modalities and outperforms state-of-the-art models for image privacy prediction. Moreover, we perform adversarial training on our proposed GMMF model using multiple types of noise on input data (i.e., images and/or tags). When some modalities are failed by input data with noise attacks, our approach effectively utilizes clean modalities and minimizes negative influences brought by degraded ones using fusion weights, achieving significantly stronger robustness over traditional fusion methods for image privacy prediction. The robustness of our GMMF model against data noise can even be generalized to more severe noise levels. To the best of our knowledge, we are the first to investigate the robustness of image privacy prediction models against noise attacks. Moreover, as the performance of decision-level multi-modal fusion depends highly on the quality of single-modal networks, we investigate self-distillation on single-modal privacy classifiers and observe that transferring knowledge from a trained teacher model to a student model is beneficial in our proposed approach.

随着移动设备技术的飞速发展，人们可以在Facebook、Flickr、Instagram等社交网站上发布自己的日常生活。这导致了新的隐私问题，因为人们不了解私人信息可能被泄露并被用来损害他们的利益。图像隐私预测模型用于预测图像是否包含敏感信息(私有图像)或是否可以安全地在线共享(公共图像)。尽管这项任务取得了重大进展，但仍有一些关键问题有待解决。首先，发现图像的内容和标签是自动预测图像隐私的有用模式。迄今为止，大多数图像隐私预测模型使用单一模式(仅图像或仅标记)，这限制了它们的性能。其次，我们观察到当前的图像隐私预测模型非常容易受到输入数据中的微小扰动的影响。攻击者可以在输入数据中添加微小的扰动，很容易破坏训练有素的图像隐私预测模型。为了解决这些挑战，在本文中，我们提出了一种新的决策级门控多模态融合(GMMF)方法，该方法融合了对象、场景和图像标签的模态来预测在线图像的隐私。特别是，该方法根据单模态分类器对每个目标图像隐私预测的可靠性，逐样本地识别分类器生成的类概率分布的融合权值，并进行加权决策级融合，使可靠性高的分类器具有较高的融合权值，而可靠性低的分类器具有较低的融合权值。我们的实验结果表明，门控多模态融合网络有效地融合了单一模态，并且优于最先进的图像隐私预测模型。此外，我们使用输入数据(即图像和/或标签)上的多种类型噪声对我们提出的GMMF模型进行对抗性训练。当某些模态被带有噪声攻击的输入数据破坏时，我们的方法有效地利用干净模态，并利用融合权重最小化退化模态带来的负面影响，实现了比传统融合方法更强的图像隐私预测鲁棒性。我们的GMMF模型对数据噪声的鲁棒性甚至可以推广到更严重的噪声水平。据我们所知，我们是第一个研究图像隐私预测模型对噪声攻击的鲁棒性的人。此外，由于决策级多模态融合的性能高度依赖于单模态网络的质量，我们研究了单模态隐私分类器的自蒸馏，并观察到在我们提出的方法中，将知识从训练有素的教师模型转移到学生模型是有益的。

{"title":"Deep Gated Multi-modal Fusion for Image Privacy Prediction","authors":"Chenye Zhao, Cornelia Caragea","doi":"10.1145/3608446","DOIUrl":"https://doi.org/10.1145/3608446","url":null,"abstract":"With the rapid development of technologies in mobile devices, people can post their daily lives on social networking sites such as Facebook, Flickr, and Instagram. This leads to new privacy concerns due to people’s lack of understanding that private information can be leaked and used to their detriment. Image privacy prediction models are developed to predict whether images contain sensitive information (private images) or are safe to be shared online (public images). Despite significant progress on this task, there are still some crucial problems that remain to be solved. Firstly, images’ content and tags are found to be useful modalities to automatically predict images’ privacy. To date, most image privacy prediction models use single modalities (image-only or tag-only), which limits their performance. Secondly, we observe that current image privacy prediction models are surprisingly vulnerable to even small perturbations in the input data. Attackers can add small perturbations to input data and easily damage a well-trained image privacy prediction model. To address these challenges, in this paper, we propose a new decision-level Gated multi-modal fusion (GMMF) approach that fuses object, scene, and image tags modalities to predict privacy for online images. In particular, the proposed approach identifies fusion weights of class probability distributions generated by single-modal classifiers according to their reliability of the privacy prediction for each target image in a sample-by-sample manner and performs a weighted decision-level fusion, so that modalities with high reliability are assigned with higher fusion weights while ones with low reliability are restrained with lower fusion weights. The results of our experiments show that the gated multi-modal fusion network effectively fuses single modalities and outperforms state-of-the-art models for image privacy prediction. Moreover, we perform adversarial training on our proposed GMMF model using multiple types of noise on input data (i.e., images and/or tags). When some modalities are failed by input data with noise attacks, our approach effectively utilizes clean modalities and minimizes negative influences brought by degraded ones using fusion weights, achieving significantly stronger robustness over traditional fusion methods for image privacy prediction. The robustness of our GMMF model against data noise can even be generalized to more severe noise levels. To the best of our knowledge, we are the first to investigate the robustness of image privacy prediction models against noise attacks. Moreover, as the performance of decision-level multi-modal fusion depends highly on the quality of single-modal networks, we investigate self-distillation on single-modal privacy classifiers and observe that transferring knowledge from a trained teacher model to a student model is beneficial in our proposed approach.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":" ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44036319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dynamic Bayesian Contrastive Predictive Coding Model for Personalized Product Search 个性化产品搜索的动态贝叶斯对比预测编码模型

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on the Web

Pub Date : 2023-07-13 DOI: https://dl.acm.org/doi/10.1145/3609225

Bin Wu, Zaiqiao Meng, Shangsong Liang

In this paper, we study the problem of dynamic personalized product search. Due to the data-sparsity problem in the real world, existing methods suffer from the challenge of data inefficiency. We address the challenge by proposing a Dynamic Bayesian Contrastive Predictive Coding model (DBCPC), which aims to capture the rich structured information behind search records to improve data efficiency. Our proposed DBCPC utilizes the contrastive predictive learning to jointly learn dynamic embeddings with structure information of entities (i.e., users, products and words). Specifically, our DBCPC employs the structured prediction to tackle the intractability caused by non-linear output space and utilizes the time embedding technique to avoid designing different encoders for each time in the Dynamic Bayesian models. In this way, our model jointly learns the underlying embeddings of entities (i.e., users, products and words) via prediction tasks, which enables the embeddings to focus more on their general attributes and capture the general information during the preference evolution with time. For inferring the dynamic embeddings, we propose an inference algorithm combining the variational objective and the contrastive objectives. Experiments were conducted on an Amazon dataset and the experimental results show that our proposed DBCPC can learn the higher-quality embeddings and outperforms the state-of-the-art non-dynamic and dynamic models for product search.

本文研究了动态个性化产品搜索问题。由于现实世界中的数据稀疏性问题，现有方法面临着数据效率低下的挑战。为了解决这个问题，我们提出了一个动态贝叶斯对比预测编码模型(DBCPC)，该模型旨在捕获搜索记录背后丰富的结构化信息，以提高数据效率。我们提出的DBCPC利用对比预测学习，与实体(即用户、产品和单词)的结构信息共同学习动态嵌入。具体来说，我们的DBCPC采用结构化预测来解决非线性输出空间带来的棘手问题，并利用时间嵌入技术来避免在动态贝叶斯模型中每次设计不同的编码器。通过这种方式，我们的模型通过预测任务共同学习实体(即用户、产品和单词)的底层嵌入，使嵌入更加关注其一般属性，并在偏好随时间演变的过程中捕获一般信息。为了对动态嵌入进行推理，我们提出了一种结合变分目标和对比目标的推理算法。在Amazon数据集上进行了实验，实验结果表明，我们提出的DBCPC可以学习到更高质量的嵌入，并且优于最先进的非动态和动态产品搜索模型。

{"title":"Dynamic Bayesian Contrastive Predictive Coding Model for Personalized Product Search","authors":"Bin Wu, Zaiqiao Meng, Shangsong Liang","doi":"https://dl.acm.org/doi/10.1145/3609225","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3609225","url":null,"abstract":"In this paper, we study the problem of dynamic personalized product search. Due to the data-sparsity problem in the real world, existing methods suffer from the challenge of data inefficiency. We address the challenge by proposing a Dynamic Bayesian Contrastive Predictive Coding model (DBCPC), which aims to capture the rich structured information behind search records to improve data efficiency. Our proposed DBCPC utilizes the contrastive predictive learning to jointly learn dynamic embeddings with structure information of entities (i.e., users, products and words). Specifically, our DBCPC employs the structured prediction to tackle the intractability caused by non-linear output space and utilizes the time embedding technique to avoid designing different encoders for each time in the Dynamic Bayesian models. In this way, our model jointly learns the underlying embeddings of entities (i.e., users, products and words) via prediction tasks, which enables the embeddings to focus more on their general attributes and capture the general information during the preference evolution with time. For inferring the dynamic embeddings, we propose an inference algorithm combining the variational objective and the contrastive objectives. Experiments were conducted on an Amazon dataset and the experimental results show that our proposed DBCPC can learn the higher-quality embeddings and outperforms the state-of-the-art non-dynamic and dynamic models for product search.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":"42 37","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138495119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dynamic Bayesian Contrastive Predictive Coding Model for Personalized Product Search 个性化产品搜索的动态贝叶斯对比预测编码模型

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on the Web

Pub Date : 2023-07-13 DOI: 10.1145/3609225

Bin Wu, Zaiqiao Meng, Shangsong Liang

In this paper, we study the problem of dynamic personalized product search. Due to the data-sparsity problem in the real world, existing methods suffer from the challenge of data inefficiency. We address the challenge by proposing a Dynamic Bayesian Contrastive Predictive Coding model (DBCPC), which aims to capture the rich structured information behind search records to improve data efficiency. Our proposed DBCPC utilizes the contrastive predictive learning to jointly learn dynamic embeddings with structure information of entities (i.e., users, products and words). Specifically, our DBCPC employs the structured prediction to tackle the intractability caused by non-linear output space and utilizes the time embedding technique to avoid designing different encoders for each time in the Dynamic Bayesian models. In this way, our model jointly learns the underlying embeddings of entities (i.e., users, products and words) via prediction tasks, which enables the embeddings to focus more on their general attributes and capture the general information during the preference evolution with time. For inferring the dynamic embeddings, we propose an inference algorithm combining the variational objective and the contrastive objectives. Experiments were conducted on an Amazon dataset and the experimental results show that our proposed DBCPC can learn the higher-quality embeddings and outperforms the state-of-the-art non-dynamic and dynamic models for product search.

本文研究了动态个性化产品搜索问题。由于现实世界中的数据稀疏性问题，现有方法面临着数据效率低下的挑战。为了解决这个问题，我们提出了一个动态贝叶斯对比预测编码模型(DBCPC)，该模型旨在捕获搜索记录背后丰富的结构化信息，以提高数据效率。我们提出的DBCPC利用对比预测学习，与实体(即用户、产品和单词)的结构信息共同学习动态嵌入。具体来说，我们的DBCPC采用结构化预测来解决非线性输出空间带来的棘手问题，并利用时间嵌入技术来避免在动态贝叶斯模型中每次设计不同的编码器。通过这种方式，我们的模型通过预测任务共同学习实体(即用户、产品和单词)的底层嵌入，使嵌入更加关注其一般属性，并在偏好随时间演变的过程中捕获一般信息。为了对动态嵌入进行推理，我们提出了一种结合变分目标和对比目标的推理算法。在Amazon数据集上进行了实验，实验结果表明，我们提出的DBCPC可以学习到更高质量的嵌入，并且优于最先进的非动态和动态产品搜索模型。

{"title":"Dynamic Bayesian Contrastive Predictive Coding Model for Personalized Product Search","authors":"Bin Wu, Zaiqiao Meng, Shangsong Liang","doi":"10.1145/3609225","DOIUrl":"https://doi.org/10.1145/3609225","url":null,"abstract":"In this paper, we study the problem of dynamic personalized product search. Due to the data-sparsity problem in the real world, existing methods suffer from the challenge of data inefficiency. We address the challenge by proposing a Dynamic Bayesian Contrastive Predictive Coding model (DBCPC), which aims to capture the rich structured information behind search records to improve data efficiency. Our proposed DBCPC utilizes the contrastive predictive learning to jointly learn dynamic embeddings with structure information of entities (i.e., users, products and words). Specifically, our DBCPC employs the structured prediction to tackle the intractability caused by non-linear output space and utilizes the time embedding technique to avoid designing different encoders for each time in the Dynamic Bayesian models. In this way, our model jointly learns the underlying embeddings of entities (i.e., users, products and words) via prediction tasks, which enables the embeddings to focus more on their general attributes and capture the general information during the preference evolution with time. For inferring the dynamic embeddings, we propose an inference algorithm combining the variational objective and the contrastive objectives. Experiments were conducted on an Amazon dataset and the experimental results show that our proposed DBCPC can learn the higher-quality embeddings and outperforms the state-of-the-art non-dynamic and dynamic models for product search.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":" ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47728212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Closeness Centrality on Uncertain Graphs 不确定图的接近中心性

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on the Web

Pub Date : 2023-07-11 DOI: https://dl.acm.org/doi/10.1145/3604912

Zhenfang Liu, Jianxiong Ye, Zhaonian Zou

Centrality is a family of metrics for characterizing the importance of a vertex in a graph. Although a large number of centrality metrics have been proposed, a majority of them ignores uncertainty in graph data. In this article, we formulate closeness centrality on uncertain graphs and define the batch closeness centrality evaluation problem that computes the closeness centrality of a subset of vertices in an uncertain graph. We develop three algorithms, MS-BCC, MG-BCC, and MGMS-BCC, based on sampling to approximate the closeness centrality of the specified vertices. All these algorithms require to perform breadth-first searches (BFS) starting from the specified vertices on a large number of sampled possible worlds of the uncertain graph. To improve the efficiency of the algorithms, we exploit operation-level parallelism of the BFS traversals and simultaneously execute the shared sequences of operations in the breadth-first searches. Parallelization is realized at different levels in these algorithms. The experimental results show that the proposed algorithms can efficiently and accurately approximate the closeness centrality of the given vertices. MGMS-BCC is faster than both MS-BCC and MG-BCC because it avoids more repeated executions of the shared operation sequences in the BFS traversals.

中心性是表征图中一个顶点重要性的一组度量。虽然已经提出了大量的中心性度量，但大多数都忽略了图数据中的不确定性。在本文中，我们提出了不确定图的接近中心性，并定义了计算不确定图中一个顶点子集的接近中心性的批量接近中心性评估问题。我们开发了三种算法，MS-BCC, MG-BCC和MGMS-BCC，基于采样来近似指定顶点的接近中心性。所有这些算法都需要在不确定图的大量采样可能世界上从指定顶点开始进行广度优先搜索(BFS)。为了提高算法的效率，我们利用BFS遍历的操作级并行性，并在广度优先搜索中同时执行共享的操作序列。这些算法在不同层次上实现了并行化。实验结果表明，所提算法能有效、准确地逼近给定顶点的接近中心性。MGMS-BCC比MS-BCC和MG-BCC都快，因为它避免了在BFS遍历中重复执行共享操作序列。

{"title":"Closeness Centrality on Uncertain Graphs","authors":"Zhenfang Liu, Jianxiong Ye, Zhaonian Zou","doi":"https://dl.acm.org/doi/10.1145/3604912","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3604912","url":null,"abstract":"Centrality is a family of metrics for characterizing the importance of a vertex in a graph. Although a large number of centrality metrics have been proposed, a majority of them ignores uncertainty in graph data. In this article, we formulate closeness centrality on uncertain graphs and define the batch closeness centrality evaluation problem that computes the closeness centrality of a subset of vertices in an uncertain graph. We develop three algorithms, <sans-serif>MS-BCC</sans-serif>, <sans-serif>MG-BCC,</sans-serif> and <sans-serif>MGMS-BCC</sans-serif>, based on sampling to approximate the closeness centrality of the specified vertices. All these algorithms require to perform breadth-first searches (BFS) starting from the specified vertices on a large number of sampled possible worlds of the uncertain graph. To improve the efficiency of the algorithms, we exploit operation-level parallelism of the BFS traversals and simultaneously execute the shared sequences of operations in the breadth-first searches. Parallelization is realized at different levels in these algorithms. The experimental results show that the proposed algorithms can efficiently and accurately approximate the closeness centrality of the given vertices. <sans-serif>MGMS-BCC</sans-serif> is faster than both <sans-serif>MS-BCC</sans-serif> and <sans-serif>MG-BCC</sans-serif> because it avoids more repeated executions of the shared operation sequences in the BFS traversals.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":"113 1","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138516924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Into the Unknown: Exploration of Search Engines’ Responses to Users with Depression and Anxiety 进入未知:探索搜索引擎对抑郁和焦虑用户的反应

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on the Web

Pub Date : 2023-07-11 DOI: https://dl.acm.org/doi/10.1145/3580283

Ashlee Milton, Maria Soledad Pera

Researchers worldwide have explored the behavioral nuances that emerge from interactions of individuals afflicted by mental health disorders (MHD) with persuasive technologies, mainly social media. Yet, there is a gap in the analysis pertaining to a persuasive technology that is part of their everyday lives: web search engines (SE). Each day, users with MHD embark on information seeking journeys using popular SE, like Google or Bing. Every step of the search process for better or worse has the potential to influence a searcher’s mindset. In this work, we empirically investigate what subliminal stimulus SE present to these vulnerable individuals during their searches. For this, we use synthetic queries to produce associated query suggestions and search engine results pages. Then we infer the subliminal stimulus present in text from SE, i.e., query suggestions, snippets, and web resources. Findings from our empirical analysis reveal that the subliminal stimulus displayed by SE at different stages of the information seeking process differ between MHD searchers and our control group composed of “average” SE users. Outcomes from this work showcase open problems related to query suggestions, search engine result pages, and ranking that the information retrieval community needs to address so that SE can better support individuals with MHD.

世界各地的研究人员已经探索了精神健康障碍(MHD)患者与劝导技术(主要是社交媒体)相互作用中出现的行为细微差别。然而，在与他们日常生活中的说服技术相关的分析中存在空白:网络搜索引擎(SE)。每天，患有MHD的用户都会使用谷歌或必应等热门搜索引擎展开信息搜索之旅。搜索过程中的每一步，无论好坏，都有可能影响搜索者的心态。在这项工作中，我们实证研究了在这些脆弱个体的搜索过程中，SE呈现给他们的阈下刺激。为此，我们使用合成查询来生成关联的查询建议和搜索引擎结果页面。然后，我们从SE中推断文本中存在的阈下刺激，即查询建议、片段和网络资源。我们的实证分析结果表明，在MHD搜索者和由“一般”SE用户组成的对照组中，SE在信息寻找过程的不同阶段所表现出的阈下刺激存在差异。这项工作的结果展示了与查询建议、搜索引擎结果页面和信息检索社区需要解决的排名相关的开放问题，以便SE能够更好地支持患有MHD的个人。

{"title":"Into the Unknown: Exploration of Search Engines’ Responses to Users with Depression and Anxiety","authors":"Ashlee Milton, Maria Soledad Pera","doi":"https://dl.acm.org/doi/10.1145/3580283","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3580283","url":null,"abstract":"Researchers worldwide have explored the behavioral nuances that emerge from interactions of individuals afflicted by mental health disorders (MHD) with persuasive technologies, mainly social media. Yet, there is a gap in the analysis pertaining to a persuasive technology that is part of their everyday lives: web search engines (SE). Each day, users with MHD embark on information seeking journeys using popular SE, like Google or Bing. Every step of the search process for better or worse has the potential to influence a searcher’s mindset. In this work, we empirically investigate what subliminal stimulus SE present to these vulnerable individuals during their searches. For this, we use synthetic queries to produce associated query suggestions and search engine results pages. Then we infer the subliminal stimulus present in text from SE, i.e., query suggestions, snippets, and web resources. Findings from our empirical analysis reveal that the subliminal stimulus displayed by SE at different stages of the information seeking process differ between MHD searchers and our control group composed of “average” SE users. Outcomes from this work showcase open problems related to query suggestions, search engine result pages, and ranking that the information retrieval community needs to address so that SE can better support individuals with MHD.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":"43 5","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138495104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Novel Review Helpfulness Measure Based on the User-Review-Item Paradigm 一种基于用户-评论-项目范式的评论帮助度度量方法

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on the Web

Pub Date : 2023-07-11 DOI: https://dl.acm.org/doi/10.1145/3585280

Luca Pajola, Dongkai Chen, Mauro Conti, V.S. Subrahmanian

Review platforms are viral online services where users share and read opinions about products (e.g., a smartphone) or experiences (e.g., a meal at a restaurant). Other users may be influenced by such opinions when deciding what to buy. The usability of review platforms is currently limited by the massive number of opinions on many products. Therefore, showing only the most helpful reviews for each product is in the best interest of both users and the platform (e.g., Amazon). The current state of the art is far from accurate in predicting how helpful a review is. First, most existing works lack compelling comparisons as many studies are conducted on datasets that are not publicly available. As a consequence, new studies are not always built on top of prior baselines. Second, most existing research focuses only on features derived from the review text, ignoring other fundamental aspects of the review platforms (e.g., the other reviews of a product, the order in which they were submitted).

In this article, we first carefully review the most relevant works in the area published during the last 20 years. We then propose the User-Review-Item (URI) paradigm, a novel abstraction for modeling the problem that moves the focus of the feature engineering from the review to the platform level. We empirically validate the URI paradigm on a dataset of products from six Amazon categories with 270 trained models: on average, classifiers gain +4% in F1-score when considering the whole review platform context. In our experiments, we further emphasize some problems with the helpfulness prediction task: (1) the users’ writing style changes over time (i.e., concept drift), (2) past models do not generalize well across different review categories, and (3) past methods to generate the ground truth produced unreliable helpfulness scores, affecting the model evaluation phase.

评论平台是一种病毒式传播的在线服务，用户可以在其中分享和阅读对产品(例如智能手机)或体验(例如在餐厅用餐)的意见。其他用户在决定买什么的时候可能会受到这些观点的影响。评论平台的可用性目前受到许多产品的大量意见的限制。因此，只显示对每个产品最有帮助的评论符合用户和平台(例如亚马逊)的最大利益。目前的技术水平还远远不能准确预测评论的帮助程度。首先，大多数现有的工作缺乏令人信服的比较，因为许多研究是在没有公开可用的数据集上进行的。因此，新的研究并不总是建立在先前的基线之上。其次，大多数现有的研究只关注来自评论文本的特征，而忽略了评论平台的其他基本方面(例如，产品的其他评论，它们提交的顺序)。在本文中，我们首先仔细回顾了近20年来在该领域发表的最相关的作品。然后我们提出了用户-审查-项目(User-Review-Item, URI)范例，这是一种对问题进行建模的新颖抽象，它将特征工程的重点从审查转移到了平台级别。我们用270个训练过的模型在亚马逊6个类别的产品数据集上对URI范式进行了实证验证:在考虑整个评论平台上下文时，分类器平均获得+4%的f1分数。在我们的实验中，我们进一步强调了有用性预测任务的一些问题:(1)用户的写作风格随着时间的推移而变化(即概念漂移);(2)过去的模型不能很好地泛化不同的评论类别;(3)过去生成基础真相的方法产生了不可靠的有用性分数，影响了模型的评估阶段。

{"title":"A Novel Review Helpfulness Measure Based on the User-Review-Item Paradigm","authors":"Luca Pajola, Dongkai Chen, Mauro Conti, V.S. Subrahmanian","doi":"https://dl.acm.org/doi/10.1145/3585280","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3585280","url":null,"abstract":"Review platforms are viral online services where users share and read opinions about products (e.g., a smartphone) or experiences (e.g., a meal at a restaurant). Other users may be influenced by such opinions when deciding what to buy. The usability of review platforms is currently limited by the massive number of opinions on many products. Therefore, showing only the most helpful reviews for each product is in the best interest of both users and the platform (e.g., Amazon). The current state of the art is far from accurate in predicting how helpful a review is. First, most existing works lack compelling comparisons as many studies are conducted on datasets that are not publicly available. As a consequence, new studies are not always built on top of prior baselines. Second, most existing research focuses only on features derived from the review text, ignoring other fundamental aspects of the review platforms (e.g., the other reviews of a product, the order in which they were submitted).In this article, we first carefully review the most relevant works in the area published during the last 20 years. We then propose the User-Review-Item (URI) paradigm, a novel abstraction for modeling the problem that moves the focus of the feature engineering from the review to the platform level. We empirically validate the URI paradigm on a dataset of products from six Amazon categories with 270 trained models: on average, classifiers gain +4% in F1-score when considering the whole review platform context. In our experiments, we further emphasize some problems with the helpfulness prediction task: (1) the users’ writing style changes over time (i.e., concept drift), (2) past models do not generalize well across different review categories, and (3) past methods to generate the ground truth produced unreliable helpfulness scores, affecting the model evaluation phase.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":"43 7","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138495102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reverse Maximum Inner Product Search: Formulation, Algorithms, and Analysis 反向最大内积搜索:公式、算法和分析

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on the Web

Pub Date : 2023-07-11 DOI: https://dl.acm.org/doi/10.1145/3587215

Daichi Amagata, Takahiro Hara

The maximum inner product search (MIPS), which finds the item with the highest inner product with a given query user, is an essential problem in the recommendation field. Usually e-commerce companies face situations where they want to promote and sell new or discounted items. In these situations, we have to consider the following questions: Who is interested in the items, and how do we find them? This article answers this question by addressing a new problem called reverse maximum inner product search (reverse MIPS). Given a query vector and two sets of vectors (user vectors and item vectors), the problem of reverse MIPS finds a set of user vectors whose inner product with the query vector is the maximum among the query and item vectors. Although the importance of this problem is clear, its straightforward implementation incurs a computationally expensive cost.

We therefore propose Simpfer, a simple, fast, and exact algorithm for reverse MIPS. In an offline phase, Simpfer builds a simple index that maintains a lower bound of the maximum inner product. By exploiting this index, Simpfer judges whether the query vector can have the maximum inner product or not, for a given user vector, in a constant time. Our index enables filtering user vectors, which cannot have the maximum inner product with the query vector, in a batch. We theoretically demonstrate that Simpfer outperforms baselines employing state-of-the-art MIPS techniques. In addition, we answer two new research questions. Can approximation algorithms further improve reverse MIPS processing? Is there an exact algorithm that is faster than Simpfer? For the former, we show that approximation with quality guarantee provides a little speed-up. For the latter, we propose Simpfer++, a theoretically and practically faster algorithm than Simpfer. Our extensive experiments on real datasets show that Simpfer is at least two orders of magnitude faster than the baselines, and Simpfer++ further improves the online processing time.

最大内积搜索(MIPS)是推荐领域的一个重要问题，即在给定的查询用户中找到具有最大内积的商品。通常电子商务公司面临的情况是，他们想要推广和销售新的或打折的商品。在这些情况下，我们必须考虑以下问题:谁对这些物品感兴趣，我们如何找到它们?本文通过解决一个称为反向最大内积搜索(reverse MIPS)的新问题来回答这个问题。给定一个查询向量和两组向量(用户向量和项目向量)，反向MIPS问题在查询向量和项目向量中找到一组与查询向量内积最大的用户向量。虽然这个问题的重要性是显而易见的，但它的直接实现会导致计算成本高昂。因此，我们提出了一种简单、快速、精确的反向MIPS算法。在脱机阶段，Simpfer构建一个简单索引，该索引维护最大内积的下界。通过利用这个索引，Simpfer判断查询向量在一定时间内对于给定的用户向量是否具有最大内积。我们的索引可以过滤用户向量，这些用户向量不能与查询向量有最大的内积。我们从理论上证明Simpfer优于采用最先进的MIPS技术的基线。此外，我们还回答了两个新的研究问题。近似算法能进一步改善反向MIPS处理吗?有没有比simplfer更快的精确算法?对于前者，我们证明了带质量保证的近似能提供少量的加速。对于后者，我们提出了Simpfer++，这是一种理论和实践上都比Simpfer更快的算法。我们在真实数据集上的大量实验表明，Simpfer比基线至少快两个数量级，并且Simpfer++进一步提高了在线处理时间。

{"title":"Reverse Maximum Inner Product Search: Formulation, Algorithms, and Analysis","authors":"Daichi Amagata, Takahiro Hara","doi":"https://dl.acm.org/doi/10.1145/3587215","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3587215","url":null,"abstract":"The maximum inner product search (MIPS), which finds the item with the highest inner product with a given query user, is an essential problem in the recommendation field. Usually e-commerce companies face situations where they want to promote and sell new or discounted items. In these situations, we have to consider the following questions: Who is interested in the items, and how do we find them? This article answers this question by addressing a new problem called reverse maximum inner product search (reverse MIPS). Given a query vector and two sets of vectors (user vectors and item vectors), the problem of reverse MIPS finds a set of user vectors whose inner product with the query vector is the maximum among the query and item vectors. Although the importance of this problem is clear, its straightforward implementation incurs a computationally expensive cost.We therefore propose Simpfer, a simple, fast, and exact algorithm for reverse MIPS. In an offline phase, Simpfer builds a simple index that maintains a lower bound of the maximum inner product. By exploiting this index, Simpfer judges whether the query vector can have the maximum inner product or not, for a given user vector, in a constant time. Our index enables filtering user vectors, which cannot have the maximum inner product with the query vector, in a batch. We theoretically demonstrate that Simpfer outperforms baselines employing state-of-the-art MIPS techniques. In addition, we answer two new research questions. Can approximation algorithms further improve reverse MIPS processing? Is there an exact algorithm that is faster than Simpfer? For the former, we show that approximation with quality guarantee provides a little speed-up. For the latter, we propose Simpfer++, a theoretically and practically faster algorithm than Simpfer. Our extensive experiments on real datasets show that Simpfer is at least two orders of magnitude faster than the baselines, and Simpfer++ further improves the online processing time.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":"43 6","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138495103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages 重新体验网络:一个转换和重放存档网页的框架

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on the Web

Pub Date : 2023-07-11 DOI: https://dl.acm.org/doi/10.1145/3589206

John Berlin, Mat Kelly, Michael L. Nelson, Michele C. Weigle

When replaying an archived web page, or memento, the fundamental expectation is that the page should be viewable and function exactly as it did at the archival time. However, this expectation requires web archives upon replay to modify the page and its embedded resources so that all resources and links reference the archive rather than the original server. Although these modifications necessarily change the state of the representation, it is understood that without them the replay of mementos from the archive would not be possible. The process of replaying mementos and the modifications made to the representations by web archives varies between archives. Because of this, there is no standard terminology for describing the replay and needed modifications. In this article, we propose terminology for describing the existing styles of replay and the modifications made on the part of web archives to mementos to facilitate replay. Because of issues discovered with server-side only modifications, we propose a general framework for the auto-generation of client-side rewriting libraries. Finally, we evaluate the effectiveness of using a generated client-side rewriting library to augment the existing replay systems of web archives by crawling mementos replayed from the Internet Archive’s Wayback Machine with and without the generated client-side rewriter. By using the generated client-side rewriter, we were able to decrease the cumulative number of requests blocked by the content security policy of the Wayback Machine for 577 mementos by 87.5% and increased the cumulative number of requests made by 32.8%. We were also able to replay mementos that were previously not replayable from the Internet Archive. Many of the client-side rewriting ideas described in this work have been implemented into Wombat, a client-side URL rewriting system that is used by the Webrecorder, Pywb, and Wayback Machine playback systems.

当重新播放存档的网页或纪念品时，基本的期望是页面应该是可见的，并且功能应该与存档时完全相同。然而，这种期望要求web存档在重放时修改页面及其嵌入的资源，以便所有资源和链接都引用存档而不是原始服务器。虽然这些修改必然会改变再现的状态，但可以理解的是，没有它们，从档案中重播纪念品是不可能的。在不同的档案中，纪念品的重放过程和对网络档案的表述所做的修改是不同的。因此，没有标准术语来描述重放和需要的修改。在这篇文章中，我们提出了术语来描述现有的重放风格和网络档案对纪念品的修改，以方便重放。由于仅在服务器端修改时发现的问题，我们提出了一个用于自动生成客户端重写库的通用框架。最后，我们评估了使用生成的客户端重写库的有效性，通过在有或没有生成的客户端重写器的情况下从Internet Archive的Wayback Machine中爬行重播的纪念品，来增强现有的web档案重播系统。通过使用生成的客户端重写器，我们能够将577个纪念品被Wayback Machine的内容安全策略阻止的累计请求数量减少87.5%，并将累计请求数量增加32.8%。我们还能够从互联网档案中重播以前无法重播的纪念品。本文中描述的许多客户端重写思想已经在Wombat中实现了，Wombat是Webrecorder、Pywb和Wayback Machine播放系统使用的一个客户端URL重写系统。

{"title":"To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages","authors":"John Berlin, Mat Kelly, Michael L. Nelson, Michele C. Weigle","doi":"https://dl.acm.org/doi/10.1145/3589206","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3589206","url":null,"abstract":"When replaying an archived web page, or memento, the fundamental expectation is that the page should be viewable and function exactly as it did at the archival time. However, this expectation requires web archives upon replay to modify the page and its embedded resources so that all resources and links reference the archive rather than the original server. Although these modifications necessarily change the state of the representation, it is understood that without them the replay of mementos from the archive would not be possible. The process of replaying mementos and the modifications made to the representations by web archives varies between archives. Because of this, there is no standard terminology for describing the replay and needed modifications. In this article, we propose terminology for describing the existing styles of replay and the modifications made on the part of web archives to mementos to facilitate replay. Because of issues discovered with server-side only modifications, we propose a general framework for the auto-generation of client-side rewriting libraries. Finally, we evaluate the effectiveness of using a generated client-side rewriting library to augment the existing replay systems of web archives by crawling mementos replayed from the Internet Archive’s Wayback Machine with and without the generated client-side rewriter. By using the generated client-side rewriter, we were able to decrease the cumulative number of requests blocked by the content security policy of the Wayback Machine for 577 mementos by 87.5% and increased the cumulative number of requests made by 32.8%. We were also able to replay mementos that were previously not replayable from the Internet Archive. Many of the client-side rewriting ideas described in this work have been implemented into Wombat, a client-side URL rewriting system that is used by the Webrecorder, Pywb, and Wayback Machine playback systems.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":"43 8","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138495101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Summarizing Web Archive Corpora Via Social Media Storytelling By Automatically Selecting and Visualizing Exemplars 通过自动选择和可视化范例，通过社交媒体讲故事来总结网络档案语料库

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on the Web

Pub Date : 2023-07-03 DOI: https://dl.acm.org/doi/10.1145/3606030

Shawn M. Jones, Martin Klein, Michele C. Weigle, Michael L. Nelson

People often create themed collections to make sense of an ever-increasing number of archived web pages. Some of these collections contain hundreds of thousands of documents. Thousands of collections exist, many covering the same topic. Few collections include standardized metadata. This scale makes understanding a collection an expensive proposition. Our Dark and Stormy Archives (DSA) five-process model implements a novel summarization method to help users understand a collection by combining web archives and social media storytelling. The five processes of the DSA model are: select exemplars, generate story metadata, generate document metadata, visualize the story, and distribute the story. Selecting exemplars produces a set of k documents from the N documents in the collection, where k < <N, thus reducing the number of documents visitors need to review to understand a collection. Generating story and document metadata selects images, titles, descriptions, and other content from these exemplars. Visualizing the story ties this metadata together in a format the visitor can consume. Without distributing the story, it is not shared for others to consume. We present a research study demonstrating that our algorithmic primitives can be combined to select relevant exemplars that are otherwise undiscoverable using a conventional search engine and query generation methods. Having demonstrated improved methods for selecting exemplars, we visualize the story. Previous work established that the social card is the best format for visitors to consume surrogates. The social card combines metadata fields, including the document’s title, a brief description, and a striking image. Social cards are commonly found on social media platforms. We discovered that these platforms perform poorly for mementos and rely on web page authors to supply the necessary values for these metadata fields. With web archives, we often encounter archived web pages that predate the existence of this metadata. To generate this missing metadata and ensure that storytelling is available for these documents, we apply machine learning to generate the images needed for social cards with a [email protected] of 0.8314. We also provide the length values needed for executing automatic summarization algorithms to generate document descriptions. Applying these concepts helps us create the visualizations needed to fulfill the final processes of story generation. We close this work with examples and applications of this technology.

人们经常创建主题集合，以使越来越多的存档网页变得有意义。其中一些藏品包含数十万份文件。成千上万的集合存在，许多涵盖相同的主题。很少有集合包含标准化的元数据。这种规模使得理解一个集合成为一个昂贵的命题。我们的黑暗和风暴档案(DSA)五过程模型实现了一种新颖的总结方法，通过结合网络档案和社交媒体故事来帮助用户理解藏品。DSA模型的五个过程是:选择范例、生成故事元数据、生成文档元数据、可视化故事和分发故事。选择范例从集合中的N个文档中生成一组k个文档，其中k <<N，从而减少了访问者为了解集合而需要查看的文档数量。生成故事和文档元数据从这些示例中选择图像、标题、描述和其他内容。可视化故事将这些元数据以访问者可以使用的格式联系在一起。不分发故事，就不能分享给其他人消费。我们提出了一项研究，表明我们的算法原语可以结合起来选择使用传统搜索引擎和查询生成方法无法发现的相关示例。在演示了选择范例的改进方法之后，我们将故事可视化。之前的研究表明，社交卡是游客消费代用品的最佳形式。社交卡组合了元数据字段，包括文档的标题、简要描述和引人注目的图像。社交卡在社交媒体平台上很常见。我们发现这些平台在纪念品方面表现不佳，并且依赖网页作者为这些元数据字段提供必要的值。对于web存档，我们经常会遇到在元数据存在之前就已存档的网页。为了生成这些缺失的元数据并确保这些文档的故事叙述可用，我们应用机器学习来生成社交卡所需的图像，[email protected]为0.8314。我们还提供了执行自动摘要算法以生成文档描述所需的长度值。运用这些概念可以帮助我们创造出完成故事生成最终过程所需的可视化效果。我们以这项技术的例子和应用来结束这项工作。

{"title":"Summarizing Web Archive Corpora Via Social Media Storytelling By Automatically Selecting and Visualizing Exemplars","authors":"Shawn M. Jones, Martin Klein, Michele C. Weigle, Michael L. Nelson","doi":"https://dl.acm.org/doi/10.1145/3606030","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3606030","url":null,"abstract":"People often create themed collections to make sense of an ever-increasing number of archived web pages. Some of these collections contain hundreds of thousands of documents. Thousands of collections exist, many covering the same topic. Few collections include standardized metadata. This scale makes understanding a collection an expensive proposition. Our Dark and Stormy Archives (DSA) five-process model implements a novel summarization method to help users understand a collection by combining web archives and social media storytelling. The five processes of the DSA model are: select exemplars, generate story metadata, generate document metadata, visualize the story, and distribute the story. Selecting exemplars produces a set of k documents from the N documents in the collection, where k < <N, thus reducing the number of documents visitors need to review to understand a collection. Generating story and document metadata selects images, titles, descriptions, and other content from these exemplars. Visualizing the story ties this metadata together in a format the visitor can consume. Without distributing the story, it is not shared for others to consume. We present a research study demonstrating that our algorithmic primitives can be combined to select relevant exemplars that are otherwise undiscoverable using a conventional search engine and query generation methods. Having demonstrated improved methods for selecting exemplars, we visualize the story. Previous work established that the social card is the best format for visitors to consume surrogates. The social card combines metadata fields, including the document’s title, a brief description, and a striking image. Social cards are commonly found on social media platforms. We discovered that these platforms perform poorly for mementos and rely on web page authors to supply the necessary values for these metadata fields. With web archives, we often encounter archived web pages that predate the existence of this metadata. To generate this missing metadata and ensure that storytelling is available for these documents, we apply machine learning to generate the images needed for social cards with a [email protected] of 0.8314. We also provide the length values needed for executing automatic summarization algorithms to generate document descriptions. Applying these concepts helps us create the visualizations needed to fulfill the final processes of story generation. We close this work with examples and applications of this technology.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":"43 9","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138495100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0