首页 > 最新文献

Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management最新文献

英文 中文
PubMed Author-assigned Keyword Extraction (PubMedAKE) Benchmark. PubMed作者指定关键字提取(PubMedAKE)基准。
Jiasheng Sheng, Zelalem Gero, Joyce C Ho

With the ever-increasing abundance of biomedical articles, improving the accuracy of keyword search results becomes crucial for ensuring reproducible research. However, keyword extraction for biomedical articles is hard due to the existence of obscure keywords and the lack of a comprehensive benchmark. PubMedAKE is an author-assigned keyword extraction dataset that contains the title, abstract, and keywords of over 843,269 articles from the PubMed open access subset database. This dataset, publicly available on Zenodo, is the largest keyword extraction benchmark with sufficient samples to train neural networks. Experimental results using state-of-the-art baseline methods illustrate the need for developing automatic keyword extraction methods for biomedical literature.

随着生物医学论文的不断丰富,提高关键词搜索结果的准确性对于确保研究的可重复性至关重要。然而,由于存在较为模糊的关键词和缺乏全面的基准,生物医学论文的关键词提取非常困难。PubMedAKE是一个作者指定的关键字提取数据集,其中包含来自PubMed开放存取子集数据库的超过843,269篇文章的标题、摘要和关键字。这个数据集在Zenodo上公开可用,是最大的关键字提取基准,有足够的样本来训练神经网络。使用最先进的基线方法的实验结果说明了开发生物医学文献自动关键字提取方法的必要性。
{"title":"PubMed Author-assigned Keyword Extraction (PubMedAKE) Benchmark.","authors":"Jiasheng Sheng,&nbsp;Zelalem Gero,&nbsp;Joyce C Ho","doi":"10.1145/3511808.3557675","DOIUrl":"https://doi.org/10.1145/3511808.3557675","url":null,"abstract":"<p><p>With the ever-increasing abundance of biomedical articles, improving the accuracy of keyword search results becomes crucial for ensuring reproducible research. However, keyword extraction for biomedical articles is hard due to the existence of obscure keywords and the lack of a comprehensive benchmark. PubMedAKE is an author-assigned keyword extraction dataset that contains the title, abstract, and keywords of over 843,269 articles from the PubMed open access subset database. This dataset, publicly available on Zenodo, is the largest keyword extraction benchmark with sufficient samples to train neural networks. Experimental results using state-of-the-art baseline methods illustrate the need for developing automatic keyword extraction methods for biomedical literature.</p>","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":" ","pages":"4470-4474"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9652778/pdf/nihms-1846241.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40687330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
From Product Searches to Conversational Agents for E-Commerce 从产品搜索到电子商务会话代理
G. D. Fabbrizio
{"title":"From Product Searches to Conversational Agents for E-Commerce","authors":"G. D. Fabbrizio","doi":"10.1145/3511808.3557514","DOIUrl":"https://doi.org/10.1145/3511808.3557514","url":null,"abstract":"","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"129 1","pages":"5085"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73665054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Non-Visual Accessibility Assessment of Videos. 视频的非视觉无障碍评估。
Ali Selman Aydin, Yu-Jung Ko, Utku Uckun, I V Ramakrishnan, Vikas Ashok

Video accessibility is crucial for blind screen-reader users as online videos are increasingly playing an essential role in education, employment, and entertainment. While there exist quite a few techniques and guidelines that focus on creating accessible videos, there is a dearth of research that attempts to characterize the accessibility of existing videos. Therefore in this paper, we define and investigate a diverse set of video and audio-based accessibility features in an effort to characterize accessible and inaccessible videos. As a ground truth for our investigation, we built a custom dataset of 600 videos, in which each video was assigned an accessibility score based on the number of its wins in a Swiss-system tournament, where human annotators performed pairwise accessibility comparisons of videos. In contrast to existing accessibility research where the assessments are typically done by blind users, we recruited sighted users for our effort, since videos comprise a special case where sight could be required to better judge if any particular scene in a video is presently accessible or not. Subsequently, by examining the extent of association between the accessibility features and the accessibility scores, we could determine the features that signifcantly (positively or negatively) impact video accessibility and therefore serve as good indicators for assessing the accessibility of videos. Using the custom dataset, we also trained machine learning models that leveraged our handcrafted features to either classify an arbitrary video as accessible/inaccessible or predict an accessibility score for the video. Evaluation of our models yielded an F 1 score of 0.675 for binary classification and a mean absolute error of 0.53 for score prediction, thereby demonstrating their potential in video accessibility assessment while also illuminating their current limitations and the need for further research in this area.

随着在线视频在教育、就业和娱乐中发挥越来越重要的作用,视频的可访问性对盲人屏幕阅读器用户来说至关重要。虽然有相当多的技术和指导方针专注于创建可访问的视频,但缺乏试图描述现有视频可访问性的研究。因此,在本文中,我们定义和研究了一组不同的基于视频和音频的可访问性特征,以努力表征可访问和不可访问的视频。作为我们调查的基本事实,我们建立了一个包含600个视频的自定义数据集,其中每个视频根据其在瑞士系统锦标赛中的获胜次数被分配一个可访问性分数,其中人类注释者对视频进行两两可访问性比较。现有的可访问性研究通常由盲人用户进行评估,与此相反,我们招募了有视力的用户,因为视频包含一个特殊情况,可以要求视力更好地判断视频中的任何特定场景目前是否可访问。随后,通过检查可访问性特征与可访问性得分之间的关联程度,我们可以确定显著(积极或消极)影响视频可访问性的特征,从而作为评估视频可访问性的良好指标。使用自定义数据集,我们还训练了机器学习模型,该模型利用我们手工制作的特征将任意视频分类为可访问/不可访问或预测视频的可访问性分数。对我们的模型进行评估,二元分类的f1得分为0.675,分数预测的平均绝对误差为0.53,从而显示了它们在视频可访问性评估中的潜力,同时也说明了它们目前的局限性以及在该领域进一步研究的必要性。
{"title":"Non-Visual Accessibility Assessment of Videos.","authors":"Ali Selman Aydin,&nbsp;Yu-Jung Ko,&nbsp;Utku Uckun,&nbsp;I V Ramakrishnan,&nbsp;Vikas Ashok","doi":"10.1145/3459637.3482457","DOIUrl":"https://doi.org/10.1145/3459637.3482457","url":null,"abstract":"<p><p>Video accessibility is crucial for blind screen-reader users as online videos are increasingly playing an essential role in education, employment, and entertainment. While there exist quite a few techniques and guidelines that focus on creating accessible videos, there is a dearth of research that attempts to characterize the accessibility of existing videos. Therefore in this paper, we define and investigate a diverse set of video and audio-based accessibility features in an effort to characterize accessible and inaccessible videos. As a ground truth for our investigation, we built a custom dataset of 600 videos, in which each video was assigned an accessibility <i>score</i> based on the number of its wins in a Swiss-system tournament, where human annotators performed pairwise accessibility comparisons of videos. In contrast to existing accessibility research where the assessments are typically done by blind users, we recruited sighted users for our effort, since videos comprise a special case where sight could be required to better judge if any particular scene in a video is presently accessible or not. Subsequently, by examining the extent of association between the accessibility features and the accessibility scores, we could determine the features that signifcantly (positively or negatively) impact video accessibility and therefore serve as good indicators for assessing the accessibility of videos. Using the custom dataset, we also trained machine learning models that leveraged our handcrafted features to either classify an arbitrary video as accessible/inaccessible or predict an accessibility score for the video. Evaluation of our models yielded an <i>F</i> <sub>1</sub> score of 0.675 for binary classification and a mean absolute error of 0.53 for score prediction, thereby demonstrating their potential in video accessibility assessment while also illuminating their current limitations and the need for further research in this area.</p>","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"2021 ","pages":"58-67"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8845074/pdf/nihms-1777380.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39931156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Temporal Network Embedding via Tensor Factorization. 通过张量因式分解实现时态网络嵌入
Jing Ma, Qiuchen Zhang, Jian Lou, Li Xiong, Joyce C Ho

Representation learning on static graph-structured data has shown a significant impact on many real-world applications. However, less attention has been paid to the evolving nature of temporal networks, in which the edges are often changing over time. The embeddings of such temporal networks should encode both graph-structured information and the temporally evolving pattern. Existing approaches in learning temporally evolving network representations fail to capture the temporal interdependence. In this paper, we propose Toffee, a novel approach for temporal network representation learning based on tensor decomposition. Our method exploits the tensor-tensor product operator to encode the cross-time information, so that the periodic changes in the evolving networks can be captured. Experimental results demonstrate that Toffee outperforms existing methods on multiple real-world temporal networks in generating effective embeddings for the link prediction tasks.

静态图结构数据的表征学习对许多现实世界的应用产生了重大影响。然而,人们较少关注时态网络的演化性质,因为其中的边往往随着时间的推移而变化。这种时态网络的嵌入应该同时编码图结构信息和时态演变模式。现有的时间演化网络表征学习方法无法捕捉时间上的相互依存性。在本文中,我们提出了基于张量分解的时态网络表征学习新方法 Toffee。我们的方法利用张量-张量乘积算子对跨时间信息进行编码,从而捕捉到演化网络中的周期性变化。实验结果表明,在为链接预测任务生成有效嵌入方面,Toffee 在多个真实世界时态网络上的表现优于现有方法。
{"title":"Temporal Network Embedding via Tensor Factorization.","authors":"Jing Ma, Qiuchen Zhang, Jian Lou, Li Xiong, Joyce C Ho","doi":"10.1145/3459637.3482200","DOIUrl":"10.1145/3459637.3482200","url":null,"abstract":"<p><p>Representation learning on static graph-structured data has shown a significant impact on many real-world applications. However, less attention has been paid to the evolving nature of temporal networks, in which the edges are often changing over time. The embeddings of such temporal networks should encode both graph-structured information and the temporally evolving pattern. Existing approaches in learning temporally evolving network representations fail to capture the temporal interdependence. In this paper, we propose Toffee, a novel approach for temporal network representation learning based on tensor decomposition. Our method exploits the tensor-tensor product operator to encode the cross-time information, so that the periodic changes in the evolving networks can be captured. Experimental results demonstrate that Toffee outperforms existing methods on multiple real-world temporal networks in generating effective embeddings for the link prediction tasks.</p>","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":" ","pages":"3313-3317"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9652776/pdf/nihms-1846391.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40704234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Subsampled Randomized Hadamard Transform for Regression of Dynamic Graphs 动态图回归的次抽样随机Hadamard变换
M. H. Chehreghani
A well-known problem in data science and machine learning is linear regression, which is recently extended to dynamic graphs. Existing exact algorithms for updating solutions of dynamic graph regression require at least a linear time (in terms of n: the number of nodes of the graph). However, this time complexity might be intractable in practice. In this paper, we utilize subsampled randomized Hadamard transform to propose a randomized algorithm for dynamic graphs. Suppose that we are given an nxm matrix embedding M of the graph, where m ⇐ n. Let r be the number of samples required for a guaranteed approximation error, which is a sublinear function of n. After an edge insertion or an edge deletion in the graph, our algorithm updates the approximate solution in O(rm) time.
{"title":"Subsampled Randomized Hadamard Transform for Regression of Dynamic Graphs","authors":"M. H. Chehreghani","doi":"10.1145/3340531.3412158","DOIUrl":"https://doi.org/10.1145/3340531.3412158","url":null,"abstract":"A well-known problem in data science and machine learning is linear regression, which is recently extended to dynamic graphs. Existing exact algorithms for updating solutions of dynamic graph regression require at least a linear time (in terms of n: the number of nodes of the graph). However, this time complexity might be intractable in practice. In this paper, we utilize subsampled randomized Hadamard transform to propose a randomized algorithm for dynamic graphs. Suppose that we are given an nxm matrix embedding M of the graph, where m ⇐ n. Let r be the number of samples required for a guaranteed approximation error, which is a sublinear function of n. After an edge insertion or an edge deletion in the graph, our algorithm updates the approximate solution in O(rm) time.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"10 1","pages":"2045-2048"},"PeriodicalIF":0.0,"publicationDate":"2020-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78563697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Hierarchical Active Learning with Overlapping Regions. 具有重叠区域的分层主动学习。
Zhipeng Luo, Milos Hauskrecht

Learning of classification models from real-world data often requires substantial human effort devoted to instance annotation. As this process can be very time-consuming and costly, finding effective ways to reduce the annotation cost becomes critical for building such models. To address this problem we explore a new type of human feedback - region-based feedback. Briefly, a region is defined as a hypercubic subspace of the input data space and represents a subpopulation of data instances; the region's label is a human assessment of the class proportion of the data subpopulation. By using learning from label proportions algorithms one can learn instance-based classifiers from such labeled regions. In general, the key challenge is that there can be infinite many regions one can define and query in a given data space. To minimize the number and complexity of region-based queries, we propose and develop a hierarchical active learning solution that aims at incrementally building a concise hierarchy of regions. Furthermore, to avoid building a possibly class-irrelevant region hierarchy, we further propose to grow multiple different hierarchies in parallel and expand those more informative hierarchies. Through experiments on numerous data sets, we demonstrate that methods using region-based feedback can learn very good classifiers from very few and simple queries, and hence are highly effective in reducing human annotation effort needed for building classification models.

从真实世界的数据中学习分类模型通常需要大量的人力投入到实例注释中。由于这个过程非常耗时和昂贵,因此找到降低注释成本的有效方法对于构建这样的模型至关重要。为了解决这个问题,我们探索了一种新型的人类反馈——基于区域的反馈。简而言之,区域被定义为输入数据空间的超立方子空间,表示数据实例的子种群;区域的标签是对数据亚群的类比例的人类评估。通过使用从标签比例算法中学习,可以从这些标记区域中学习基于实例的分类器。一般来说,关键的挑战是在给定的数据空间中可以定义和查询无限多个区域。为了最小化基于区域的查询的数量和复杂性,我们提出并开发了一种分层主动学习解决方案,旨在逐步构建简洁的区域层次结构。此外,为了避免建立可能与类无关的区域层次结构,我们进一步提出并行增长多个不同的层次结构,并扩展这些信息更多的层次结构。通过对大量数据集的实验,我们证明了使用基于区域的反馈的方法可以从非常少和简单的查询中学习到非常好的分类器,因此在减少构建分类模型所需的人工注释方面非常有效。
{"title":"Hierarchical Active Learning with Overlapping Regions.","authors":"Zhipeng Luo,&nbsp;Milos Hauskrecht","doi":"10.1145/3340531.3412022","DOIUrl":"https://doi.org/10.1145/3340531.3412022","url":null,"abstract":"<p><p>Learning of classification models from real-world data often requires substantial human effort devoted to <i>instance</i> annotation. As this process can be very time-consuming and costly, finding effective ways to reduce the annotation cost becomes critical for building such models. To address this problem we explore a new type of human feedback - <i>region</i>-based feedback. Briefly, a region is defined as a hypercubic subspace of the input data space and represents a <i>subpopulation</i> of data instances; the region's label is a human assessment of the class <i>proportion</i> of the data subpopulation. By using <i>learning from label proportions</i> algorithms one can learn instance-based classifiers from such labeled regions. In general, the key challenge is that there can be infinite many regions one can define and query in a given data space. To minimize the number and complexity of region-based queries, we propose and develop a <i>hierarchical active learning</i> solution that aims at incrementally building a <i>concise</i> hierarchy of regions. Furthermore, to avoid building a possibly class-irrelevant region hierarchy, we further propose to grow multiple different hierarchies in parallel and expand those more informative hierarchies. Through experiments on numerous data sets, we demonstrate that methods using region-based feedback can learn very good classifiers from very few and simple queries, and hence are highly effective in reducing human annotation effort needed for building classification models.</p>","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"2020 ","pages":"1045-1054"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3340531.3412022","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38632888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
GPU-Accelerated Decoding of Integer Lists gpu加速的整数列表解码
Antonio Mallia, Michal Siedlaczek, Torsten Suel, M. Zahran
An inverted index is the basic data structure used in most current large-scale information retrieval systems. It can be modeled as a collection of sorted sequences of integers. Many compression techniques for inverted indexes have been studied in the past, with some of them reaching tremendous decompression speeds through the use of SIMD instructions available on modern CPUs. While there has been some work on query processing algorithms for Graphics Processing Units (GPUs), little of it has focused on how to efficiently access compressed index structures, and we see some potential for significant improvements in decompression speed. In this paper, we describe and implement two encoding schemes for index decompression on GPU architectures. Their format and decoding algorithm is adapted from existing CPU-based compression methods to exploit the execution model and memory hierarchy offered by GPUs. We show that our solutions, GPU-BP and GPU-VByte, achieve significant speedups over their already carefully optimized CPU counterparts.
倒排索引是目前大多数大型信息检索系统中使用的基本数据结构。它可以被建模为排序的整数序列的集合。过去已经研究了许多倒排索引的压缩技术,其中一些通过使用现代cpu上可用的SIMD指令达到了惊人的解压速度。虽然在图形处理单元(Graphics processing Units, gpu)的查询处理算法上已经做了一些工作,但很少关注如何有效地访问压缩索引结构,我们看到了在解压缩速度方面有一些显著改进的潜力。在本文中,我们描述并实现了两种用于GPU架构索引解压缩的编码方案。它们的格式和解码算法是基于现有的基于cpu的压缩方法,利用gpu提供的执行模型和内存层次结构。我们表明,我们的解决方案,GPU-BP和GPU-VByte,实现显著的速度比他们已经精心优化的CPU对手。
{"title":"GPU-Accelerated Decoding of Integer Lists","authors":"Antonio Mallia, Michal Siedlaczek, Torsten Suel, M. Zahran","doi":"10.1145/3357384.3358067","DOIUrl":"https://doi.org/10.1145/3357384.3358067","url":null,"abstract":"An inverted index is the basic data structure used in most current large-scale information retrieval systems. It can be modeled as a collection of sorted sequences of integers. Many compression techniques for inverted indexes have been studied in the past, with some of them reaching tremendous decompression speeds through the use of SIMD instructions available on modern CPUs. While there has been some work on query processing algorithms for Graphics Processing Units (GPUs), little of it has focused on how to efficiently access compressed index structures, and we see some potential for significant improvements in decompression speed.\u0000 In this paper, we describe and implement two encoding schemes for index decompression on GPU architectures. Their format and decoding algorithm is adapted from existing CPU-based compression methods to exploit the execution model and memory hierarchy offered by GPUs. We show that our solutions, GPU-BP and GPU-VByte, achieve significant speedups over their already carefully optimized CPU counterparts.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"19 30 1","pages":"2193-2196"},"PeriodicalIF":0.0,"publicationDate":"2019-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78160482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Privacy-Preserving Tensor Factorization for Collaborative Health Data Analysis. 用于协作式健康数据分析的隐私保护张量因式分解。
Jing Ma, Qiuchen Zhang, Jian Lou, Joyce C Ho, Li Xiong, Xiaoqian Jiang

Tensor factorization has been demonstrated as an efficient approach for computational phenotyping, where massive electronic health records (EHRs) are converted to concise and meaningful clinical concepts. While distributing the tensor factorization tasks to local sites can avoid direct data sharing, it still requires the exchange of intermediary results which could reveal sensitive patient information. Therefore, the challenge is how to jointly decompose the tensor under rigorous and principled privacy constraints, while still support the model's interpretability. We propose DPFact, a privacy-preserving collaborative tensor factorization method for computational phenotyping using EHR. It embeds advanced privacy-preserving mechanisms with collaborative learning. Hospitals can keep their EHR database private but also collaboratively learn meaningful clinical concepts by sharing differentially private intermediary results. Moreover, DPFact solves the heterogeneous patient population using a structured sparsity term. In our framework, each hospital decomposes its local tensors and sends the updated intermediary results with output perturbation every several iterations to a semi-trusted server which generates the phenotypes. The evaluation on both real-world and synthetic datasets demonstrated that under strict privacy constraints, our method is more accurate and communication-efficient than state-of-the-art baseline methods.

张量因式分解已被证明是一种高效的计算表型方法,可将海量电子健康记录(EHR)转换为简洁而有意义的临床概念。虽然将张量因式分解任务分配给本地站点可以避免直接数据共享,但仍需要交换中间结果,这可能会泄露敏感的患者信息。因此,我们面临的挑战是如何在严格、有原则的隐私约束条件下联合分解张量,同时仍然支持模型的可解释性。我们提出的 DPFact 是一种保护隐私的协作张量因式分解方法,用于使用电子病历进行计算表型。它将先进的隐私保护机制与协作学习相结合。医院既能保持电子病历数据库的私密性,又能通过共享不同私密性的中间结果协作学习有意义的临床概念。此外,DPFact 还利用结构化稀疏项解决了异质性患者群体的问题。在我们的框架中,每家医院都会分解其本地张量,并每隔几次迭代将带有输出扰动的更新中间结果发送给半信任服务器,由其生成表型。在真实世界和合成数据集上进行的评估表明,在严格的隐私限制条件下,我们的方法比最先进的基线方法更准确,通信效率更高。
{"title":"Privacy-Preserving Tensor Factorization for Collaborative Health Data Analysis.","authors":"Jing Ma, Qiuchen Zhang, Jian Lou, Joyce C Ho, Li Xiong, Xiaoqian Jiang","doi":"10.1145/3357384.3357878","DOIUrl":"10.1145/3357384.3357878","url":null,"abstract":"<p><p>Tensor factorization has been demonstrated as an efficient approach for computational phenotyping, where massive electronic health records (EHRs) are converted to concise and meaningful clinical concepts. While distributing the tensor factorization tasks to local sites can avoid direct data sharing, it still requires the exchange of intermediary results which could reveal sensitive patient information. Therefore, the challenge is how to jointly decompose the tensor under rigorous and principled privacy constraints, while still support the model's interpretability. We propose DPFact, a privacy-preserving collaborative tensor factorization method for computational phenotyping using EHR. It embeds advanced privacy-preserving mechanisms with collaborative learning. Hospitals can keep their EHR database private but also collaboratively learn meaningful clinical concepts by sharing differentially private intermediary results. Moreover, DPFact solves the heterogeneous patient population using a structured sparsity term. In our framework, each hospital decomposes its local tensors and sends the updated intermediary results with output perturbation every several iterations to a semi-trusted server which generates the phenotypes. The evaluation on both real-world and synthetic datasets demonstrated that under strict privacy constraints, our method is more accurate and communication-efficient than state-of-the-art baseline methods.</p>","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"2019 ","pages":"1291-1300"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6940039/pdf/nihms-1052726.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37508089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
COPA: Constrained PARAFAC2 for Sparse & Large Datasets. COPA:适用于稀疏和大型数据集的受约束 PARAFAC2
Ardavan Afshar, Ioakeim Perros, Evangelos E Papalexakis, Elizabeth Searles, Joyce Ho, Jimeng Sun

PARAFAC2 has demonstrated success in modeling irregular tensors, where the tensor dimensions vary across one of the modes. An example scenario is modeling treatments across a set of patients with the varying number of medical encounters over time. Despite recent improvements on unconstrained PARAFAC2, its model factors are usually dense and sensitive to noise which limits their interpretability. As a result, the following open challenges remain: a) various modeling constraints, such as temporal smoothness, sparsity and non-negativity, are needed to be imposed for interpretable temporal modeling and b) a scalable approach is required to support those constraints efficiently for large datasets. To tackle these challenges, we propose a COnstrained PARAFAC2 (COPA) method, which carefully incorporates optimization constraints such as temporal smoothness, sparsity, and non-negativity in the resulting factors. To efficiently support all those constraints, COPA adopts a hybrid optimization framework using alternating optimization and alternating direction method of multiplier (AO-ADMM). As evaluated on large electronic health record (EHR) datasets with hundreds of thousands of patients, COPA achieves significant speedups (up to 36× faster) over prior PARAFAC2 approaches that only attempt to handle a subset of the constraints that COPA enables. Overall, our method outperforms all the baselines attempting to handle a subset of the constraints in terms of speed, while achieving the same level of accuracy. Through a case study on temporal phenotyping of medically complex children, we demonstrate how the constraints imposed by COPA reveal concise phenotypes and meaningful temporal profiles of patients. The clinical interpretation of both the phenotypes and the temporal profiles was confirmed by a medical expert.

PARAFAC2 在对不规则张量建模方面取得了成功,在这种情况下,张量维数在其中一种模式下会发生变化。其中一个例子是对一组病人的治疗进行建模,这些病人的就诊次数随时间变化。尽管无约束 PARAFAC2 最近有所改进,但其模型因子通常很密集,对噪声很敏感,这限制了其可解释性。因此,以下挑战依然存在:a) 需要施加各种建模约束,如时间平滑性、稀疏性和非负性,以实现可解释的时间建模;b) 需要一种可扩展的方法,以有效支持大型数据集的这些约束。为了应对这些挑战,我们提出了一种 COnstrained PARAFAC2(COPA)方法,该方法将时间平滑性、稀疏性和非负性等优化约束谨慎地纳入了所得到的因子中。为了有效支持所有这些约束条件,COPA 采用了交替优化和乘法器交替方向法(AO-ADMM)的混合优化框架。在包含数十万名患者的大型电子健康记录(EHR)数据集上进行的评估表明,与之前的 PARAFAC2 方法相比,COPA 实现了显著的提速(快达 36 倍)。总体而言,我们的方法在速度方面优于所有尝试处理子集约束的基线方法,同时达到了相同的准确度水平。通过对病情复杂的儿童进行时间表型分析的案例研究,我们展示了 COPA 所施加的约束是如何揭示病人的简明表型和有意义的时间轮廓的。表型和时间轮廓的临床解释得到了医学专家的确认。
{"title":"COPA: Constrained PARAFAC2 for Sparse & Large Datasets.","authors":"Ardavan Afshar, Ioakeim Perros, Evangelos E Papalexakis, Elizabeth Searles, Joyce Ho, Jimeng Sun","doi":"10.1145/3269206.3271775","DOIUrl":"10.1145/3269206.3271775","url":null,"abstract":"<p><p>PARAFAC2 has demonstrated success in modeling irregular tensors, where the tensor dimensions vary across one of the modes. An example scenario is modeling treatments across a set of patients with the varying number of medical encounters over time. Despite recent improvements on unconstrained PARAFAC2, its model factors are usually dense and sensitive to noise which limits their interpretability. As a result, the following open challenges remain: a) various modeling constraints, such as temporal smoothness, sparsity and non-negativity, are needed to be imposed for interpretable temporal modeling and b) a scalable approach is required to support those constraints efficiently for large datasets. To tackle these challenges, we propose a <i>CO</i>nstrained <i>PA</i>RAFAC2 (COPA) method, which carefully incorporates optimization constraints such as temporal smoothness, sparsity, and non-negativity in the resulting factors. To efficiently support all those constraints, COPA adopts a hybrid optimization framework using alternating optimization and alternating direction method of multiplier (AO-ADMM). As evaluated on large electronic health record (EHR) datasets with hundreds of thousands of patients, COPA achieves significant speedups (up to 36× faster) over prior PARAFAC2 approaches that only attempt to handle a subset of the constraints that COPA enables. Overall, our method outperforms all the baselines attempting to handle a subset of the constraints in terms of speed, while achieving the same level of accuracy. Through a case study on temporal phenotyping of medically complex children, we demonstrate how the constraints imposed by COPA reveal concise phenotypes and meaningful temporal profiles of patients. The clinical interpretation of both the phenotypes and the temporal profiles was confirmed by a medical expert.</p>","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"2018 ","pages":"793-802"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7472553/pdf/nihms-1619557.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38361347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Graph Embedding for Ranking Optimization in E-commerce. 基于深度图嵌入的电子商务排名优化。
Chen Chu, Zhao Li, Beibei Xin, Fengchao Peng, Chuanren Liu, Remo Rohs, Qiong Luo, Jingren Zhou

Matching buyers with most suitable sellers providing relevant items (e.g., products) is essential for e-commerce platforms to guarantee customer experience. This matching process is usually achieved through modeling inter-group (buyer-seller) proximity by e-commerce ranking systems. However, current ranking systems often match buyers with sellers of various qualities, and the mismatch is detrimental to not only buyers' level of satisfaction but also the platforms' return on investment (ROI). In this paper, we address this problem by incorporating intra-group structural information (e.g., buyer-buyer proximity implied by buyer attributes) into the ranking systems. Specifically, we propose Deep Graph Embedding (DEGREE), a deep learning based method, to exploit both inter-group and intra-group proximities jointly for structural learning. With a sparse filtering technique, DEGREE can significantly improve the matching performance with computation resources less than that of alternative deep learning based methods. Experimental results demonstrate that DEGREE outperforms state-of-the-art graph embedding methods on real-world e-commence datasets. In particular, our solution boosts the average unit price in purchases during an online A/B test by up to 11.93%, leading to better operational efficiency and shopping experience.

将买家与提供相关商品(如产品)的最合适卖家匹配,是电商平台保障客户体验的关键。这种匹配过程通常是通过电子商务排名系统对组间(买家-卖家)接近度进行建模来实现的。然而,目前的排名系统往往将买家和卖家的素质不同,这种不匹配不仅不利于买家的满意度,也不利于平台的投资回报率(ROI)。在本文中,我们通过将组内结构信息(例如,买方属性暗示的买方-买方接近度)纳入排名系统来解决这个问题。具体来说,我们提出了深度图嵌入(DEGREE),这是一种基于深度学习的方法,可以同时利用组间和组内的接近性来进行结构学习。与其他基于深度学习的方法相比,DEGREE可以在计算资源较少的情况下显著提高匹配性能。实验结果表明,DEGREE在现实世界的电子商务数据集上优于最先进的图嵌入方法。特别是,我们的解决方案将在线A/B测试期间的平均购买单价提高了11.93%,从而提高了运营效率和购物体验。
{"title":"Deep Graph Embedding for Ranking Optimization in E-commerce.","authors":"Chen Chu,&nbsp;Zhao Li,&nbsp;Beibei Xin,&nbsp;Fengchao Peng,&nbsp;Chuanren Liu,&nbsp;Remo Rohs,&nbsp;Qiong Luo,&nbsp;Jingren Zhou","doi":"10.1145/3269206.3272028","DOIUrl":"https://doi.org/10.1145/3269206.3272028","url":null,"abstract":"<p><p>Matching buyers with most suitable sellers providing relevant items (e.g., products) is essential for e-commerce platforms to guarantee customer experience. This matching process is usually achieved through modeling inter-group (buyer-seller) proximity by e-commerce ranking systems. However, current ranking systems often match buyers with sellers of various qualities, and the mismatch is detrimental to not only buyers' level of satisfaction but also the platforms' return on investment (ROI). In this paper, we address this problem by incorporating intra-group structural information (e.g., buyer-buyer proximity implied by buyer attributes) into the ranking systems. Specifically, we propose <b>De</b>ep <b>Gr</b>aph <b>E</b>mb<b>e</b>dding (DEGREE), a deep learning based method, to exploit both inter-group and intra-group proximities jointly for structural learning. With a sparse filtering technique, DEGREE can significantly improve the matching performance with computation resources less than that of alternative deep learning based methods. Experimental results demonstrate that DEGREE outperforms state-of-the-art graph embedding methods on real-world e-commence datasets. In particular, our solution boosts the average unit price in purchases during an online A/B test by up to 11.93%, leading to better operational efficiency and shopping experience.</p>","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"2018 ","pages":"2007-2015"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3269206.3272028","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36867253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1