Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries最新文献

英文中文

Using the Business Model Canvas to Support a Risk Assessment Method for Digital Curation 使用商业模型画布支持数字策展的风险评估方法

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

Pub Date : 2015-06-21 DOI: 10.1145/2756406.2756965

D. Proença, A. Nadali, J. Borbinha

This poster presents a pragmatic risk assessment method based on best practice from the ISO 31000 family of standards regarding risk management. The method proposed is supported by established risk management concepts that can be applied to help a data repository to gain awareness of the risks and costs of the controls for the identified risks. In simple terms the technique that supports this method is a pragmatic risk registry that can be used to identify risks from a Business Model Canvas of an organization. A Business Model Canvas is a model used in strategic management to document existing business models and develop new ones.

这张海报介绍了一种实用的风险评估方法，该方法基于ISO 31000系列风险管理标准的最佳实践。所提出的方法得到已建立的风险管理概念的支持，这些概念可用于帮助数据存储库了解已识别风险的控制的风险和成本。简单来说，支持此方法的技术是一个实用的风险注册中心，可用于从组织的业务模型画布中识别风险。业务模型画布是战略管理中用于记录现有业务模型和开发新业务模型的模型。

引用次数: 2

Improving Access to Large-scale Digital Libraries ThroughSemantic-enhanced Search and Disambiguation 通过语义增强搜索和消歧义改善对大型数字图书馆的访问

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

Pub Date : 2015-06-21 DOI: 10.1145/2756406.2756920

A. Hinze, Craig Taube-Schock, D. Bainbridge, Rangi Matamua, J. S. Downie

With 13,000,000 volumes comprising 4.5 billion pages of text, it is currently very difficult for scholars to locate relevant sets of documents that are useful in their research from the HathiTrust Digital Libary (HTDL) using traditional lexically-based retrieval techniques. Existing document search tools and document clustering approaches use purely lexical analysis, which cannot address the inherent ambiguity of natural language. A semantic search approach offers the potential to overcome the shortcoming of lexical search, but even if an appropriate network of ontologies could be decided upon it would require a full semantic markup of each document. In this paper, we present a conceptual design and report on the initial implementation of a new framework that affords the benefits of semantic search while minimizing the problems associated with applying existing semantic analysis at scale. Our approach avoids the need for complete semantic document markup using pre-existing ontologies by developing an automatically generated Concept-in-Context (CiC) network seeded by a priori analysis of Wikipedia texts and identification of semantic metadata. Our Capisco system analyzes documents by the semantics and context of their content. The disambiguation of search queries is done interactively, to fully utilize the domain knowledge of the scholar. Our method achieves a form of semantic-enhanced search that simultaneously exploits the proven scale benefits provided by lexical indexing.

目前，学者们很难使用传统的基于词汇的检索技术从HathiTrust数字图书馆(html)中找到对他们的研究有用的相关文档集，因为该图书馆有1300万册，包含45亿页的文本。现有的文档搜索工具和文档聚类方法使用纯粹的词法分析，无法解决自然语言固有的歧义。语义搜索方法有可能克服词法搜索的缺点，但是即使确定了适当的本体网络，也需要对每个文档进行完整的语义标记。在本文中，我们提出了一个概念设计，并报告了一个新框架的初步实现，该框架提供了语义搜索的好处，同时最大限度地减少了与大规模应用现有语义分析相关的问题。我们的方法通过开发一个自动生成的上下文概念(CiC)网络，通过对维基百科文本的先验分析和语义元数据的识别，避免了使用预先存在的本体来完成语义文档标记的需要。我们的Capisco系统通过其内容的语义和上下文分析文档。搜索查询的消歧是交互式的，以充分利用学者的领域知识。我们的方法实现了一种语义增强的搜索形式，同时利用了由词法索引提供的已证实的规模优势。

{"title":"Improving Access to Large-scale Digital Libraries ThroughSemantic-enhanced Search and Disambiguation","authors":"A. Hinze, Craig Taube-Schock, D. Bainbridge, Rangi Matamua, J. S. Downie","doi":"10.1145/2756406.2756920","DOIUrl":"https://doi.org/10.1145/2756406.2756920","url":null,"abstract":"With 13,000,000 volumes comprising 4.5 billion pages of text, it is currently very difficult for scholars to locate relevant sets of documents that are useful in their research from the HathiTrust Digital Libary (HTDL) using traditional lexically-based retrieval techniques. Existing document search tools and document clustering approaches use purely lexical analysis, which cannot address the inherent ambiguity of natural language. A semantic search approach offers the potential to overcome the shortcoming of lexical search, but even if an appropriate network of ontologies could be decided upon it would require a full semantic markup of each document. In this paper, we present a conceptual design and report on the initial implementation of a new framework that affords the benefits of semantic search while minimizing the problems associated with applying existing semantic analysis at scale. Our approach avoids the need for complete semantic document markup using pre-existing ontologies by developing an automatically generated Concept-in-Context (CiC) network seeded by a priori analysis of Wikipedia texts and identification of semantic metadata. Our Capisco system analyzes documents by the semantics and context of their content. The disambiguation of search queries is done interactively, to fully utilize the domain knowledge of the scholar. Our method achieves a form of semantic-enhanced search that simultaneously exploits the proven scale benefits provided by lexical indexing.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114843680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Analyzing Tagging Patterns by Integrating Visual Analytics with the Inferential Test 结合视觉分析与推理测试分析标签模式

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

Pub Date : 2015-06-21 DOI: 10.1145/2756406.2756962

Yunseon Choi

Due to the large volume and complexity of data, exploring data using visual analytics has become more helpful to interpret and analyze it. The box plot is one of graphical ways and is the most common technique for presenting and summarizing statistics. In this paper, we focus on discussing the tagging patterns by integrating visualization assessment using the box plot with the Shapiro-Wilk test.

由于数据量大且复杂，使用可视化分析来探索数据对解释和分析数据变得更有帮助。箱形图是一种图形化的方法，是表示和总结统计数据最常用的技术。在本文中，我们重点讨论了标记模式，结合可视化评估，使用盒图和夏皮罗-威尔克测试。

引用次数: 1

Automatically Generating a Concept Hierarchy with Graphs 用图形自动生成概念层次结构

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

Pub Date : 2015-06-21 DOI: 10.1145/2756406.2756967

Pucktada Treeratpituk, Madian Khabsa, C. Lee Giles

We propose a novel graph-based approach for constructing concept hierarchy from a large text corpus. Our algorithm incorporates both statistical co-occurrences and lexical similarity in optimizing the structure of the taxonomy. To automatically generate topic-dependent taxonomies from a large text corpus, we first extracts topical terms and their relationships from the corpus. The algorithm then constructs a weighted graph representing topics and their associations. A graph partitioning algorithm is then used to recursively partition the topic graph into a taxonomy. For evaluation, we apply our approach to articles, primarily computer science, in the CiteSeerX digital library and search engine.

我们提出了一种新的基于图的方法来从大型文本语料库中构建概念层次结构。我们的算法结合了统计共现和词汇相似来优化分类结构。为了从大型文本语料库中自动生成主题相关的分类法，我们首先从语料库中提取主题术语及其关系。然后，该算法构建一个表示主题及其关联的加权图。然后使用图划分算法递归地将主题图划分为一个分类法。为了进行评估，我们将我们的方法应用于CiteSeerX数字图书馆和搜索引擎中的文章，主要是计算机科学。

引用次数: 0

Content Analysis of Social Tags Generated by Health Consumers 健康消费者社会标签的内容分析

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

Pub Date : 2015-06-21 DOI: 10.1145/2756406.2756959

Soohyung Joo, Yunseon Choi

This poster presents preliminary findings of user tag analysis in the domain of consumer health information. To obtain user terms, 36,205 tags from 38 consumer health information sites were collected from delicious.com. Content analysis was applied to identify the dimensions and types of the collected tags. The preliminary findings showed that user generated tags covers a variety of aspects of health information, ranging from general terms, subject terms, knowledge type, and to audience. General terms and subject terms were observed dominantly by showing 31.7% and 22.8% respectively.

这张海报介绍了用户标签分析在消费者健康信息领域的初步发现。为了获得用户条款，从delicious网站上收集了来自38个消费者健康信息网站的36,205个标签。内容分析用于识别所收集标签的维度和类型。初步研究结果表明，用户生成的标签涵盖了健康信息的各个方面，从一般术语、主题术语、知识类型到受众。一般术语和主题词占主导地位，分别占31.7%和22.8%。

引用次数: 1

How Well Are Arabic Websites Archived? 阿拉伯网站的存档情况如何?

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

Pub Date : 2015-06-21 DOI: 10.1145/2756406.2756912

Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle

It is has long been anecdotally known that web archives and search engines favor Western and English-language sites. In this paper we quantitatively explore how well indexed and archived are Arabic language web sites. We began by sampling 15,092 unique URIs from three different website directories: DMOZ (multi-lingual), Raddadi and Star28 (both primarily Arabic language). Using language identification tools we eliminated pages not in the Arabic language (e.g., English language versions of Al-Jazeera sites) and culled the collection to 7,976 definitely Arabic language web pages. We then used these 7,976 pages and crawled the live web and web archives to produce a collection of 300,646 Arabic language pages. We discovered: 1) 46% are not archived and 31% are not indexed by Google (www.google.com), 2) only 14.84% of the URIs had an Arabic country code top-level domain (e.g., .sa) and only 10.53% had a GeoIP in an Arabic country, 3) having either only an Arabic GeoIP or only an Arabic top-level domain appears to negatively impact archiving, 4) most of the archived pages are near the top level of the site and deeper links into the site are not well-archived, 5) the presence in a directory positively impacts indexing and presence in the DMOZ directory, specifically, positively impacts archiving.

人们早就知道，网络档案和搜索引擎偏爱西方和英语网站。在本文中，我们定量地探讨如何很好地索引和存档是阿拉伯语网站。我们首先从三个不同的网站目录中抽样15092个唯一的uri: DMOZ(多语言)、Raddadi和Star28(主要都是阿拉伯语)。使用语言识别工具，我们删除了非阿拉伯语的网页(例如，半岛电视台网站的英语版本)，并剔除了7976个绝对是阿拉伯语的网页。然后，我们使用这7,976个页面，并抓取实时网络和网络档案，生成一个包含300,646个阿拉伯语页面的集合。我们发现:1) 46%没有被归档和31%不是由谷歌索引(www.google.com), 2)只有14.84%的uri有阿拉伯国家代码顶级域名(如.sa),仅10.53%的机构有GeoIP在一个阿拉伯国家,3)有只有一个阿拉伯语GeoIP或只有一个阿拉伯语顶级域名似乎产生负面影响存档,4)附近的存档页面最顶级的站点和深层链接到站点不是well-archived,5)在目录中的存在对索引和DMOZ目录中的存在有积极的影响，特别是对归档有积极的影响。

{"title":"How Well Are Arabic Websites Archived?","authors":"Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle","doi":"10.1145/2756406.2756912","DOIUrl":"https://doi.org/10.1145/2756406.2756912","url":null,"abstract":"It is has long been anecdotally known that web archives and search engines favor Western and English-language sites. In this paper we quantitatively explore how well indexed and archived are Arabic language web sites. We began by sampling 15,092 unique URIs from three different website directories: DMOZ (multi-lingual), Raddadi and Star28 (both primarily Arabic language). Using language identification tools we eliminated pages not in the Arabic language (e.g., English language versions of Al-Jazeera sites) and culled the collection to 7,976 definitely Arabic language web pages. We then used these 7,976 pages and crawled the live web and web archives to produce a collection of 300,646 Arabic language pages. We discovered: 1) 46% are not archived and 31% are not indexed by Google (www.google.com), 2) only 14.84% of the URIs had an Arabic country code top-level domain (e.g., .sa) and only 10.53% had a GeoIP in an Arabic country, 3) having either only an Arabic GeoIP or only an Arabic top-level domain appears to negatively impact archiving, 4) most of the archived pages are near the top level of the site and deeper links into the site are not well-archived, 5) the presence in a directory positively impacts indexing and presence in the DMOZ directory, specifically, positively impacts archiving.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"2013 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127390116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Moving the Needle: From Innovation to Impact 移动指针:从创新到影响

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

Pub Date : 2015-06-21 DOI: 10.1145/2756406.2756408

Katherine Skinner

The digital library has been on a seemingly insatiable quest for “innovation” for decades. This focus permeates our field, usually in the guise of transforming digital library practices. The themes change over time (e.g., Federating library collections! Digital humanities! Digital preservation! Big data!), but dependably, digital library research projects on “innovation” topics are seeded in abundance each year. Researchers are rewarded (and funded) for their big, experimental ideas, not for successful applications of innovations in practice. Gearing resources toward “innovation” alone prizes the unique or novel approach above the cultivation of our field. Few innovations ever flower and thrive beyond their initial moments in the sun. What might happen if digital libraries shift their focus from the innovative solution to the process of using innovations within networks to actively facilitate system-wide change? Drawing from the disciplines of sociology and economics, Skinner will explore both established and emergent models for system-wide transformation, ultimately asking what digital libraries could accomplish as a field if we shifted our focus from “innovation” to “impact.”

几十年来，数字图书馆似乎一直在永无止境地追求“创新”。这种关注渗透到我们的领域，通常以转变数字图书馆实践为幌子。主题会随着时间而改变(例如，联合图书馆集合!数字人文!数字保存!)，但可靠的是，每年都有大量关于“创新”主题的数字图书馆研究项目。研究人员得到奖励(和资助)是因为他们的伟大的、实验性的想法，而不是因为创新在实践中的成功应用。仅将资源用于“创新”就会奖励独特或新颖的方法，而不是培育我们的领域。很少有创新能在最初的阳光下开花并茁壮成长。如果数字图书馆将其重点从创新解决方案转移到利用网络内的创新来积极促进全系统变革的过程，会发生什么?从社会学和经济学的学科出发，斯金纳将探索系统范围内转型的既有模型和新兴模型，最终询问如果我们将重点从“创新”转移到“影响”，数字图书馆作为一个领域可以完成什么。

{"title":"Moving the Needle: From Innovation to Impact","authors":"Katherine Skinner","doi":"10.1145/2756406.2756408","DOIUrl":"https://doi.org/10.1145/2756406.2756408","url":null,"abstract":"The digital library has been on a seemingly insatiable quest for “innovation” for decades. This focus permeates our field, usually in the guise of transforming digital library practices. The themes change over time (e.g., Federating library collections! Digital humanities! Digital preservation! Big data!), but dependably, digital library research projects on “innovation” topics are seeded in abundance each year. Researchers are rewarded (and funded) for their big, experimental ideas, not for successful applications of innovations in practice. Gearing resources toward “innovation” alone prizes the unique or novel approach above the cultivation of our field. Few innovations ever flower and thrive beyond their initial moments in the sun. What might happen if digital libraries shift their focus from the innovative solution to the process of using innovations within networks to actively facilitate system-wide change? Drawing from the disciplines of sociology and economics, Skinner will explore both established and emergent models for system-wide transformation, ultimately asking what digital libraries could accomplish as a field if we shifted our focus from “innovation” to “impact.”","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123139832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Grading Degradation in an Institutionally Managed Repository 在机构管理的存储库中分级降级

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

Pub Date : 2015-06-21 DOI: 10.1145/2756406.2756966

Luis Meneses, S. Jayarathna, R. Furuta, F. Shipman

It is not unusual for digital collections to degrade and suffer from problems associated with unexpected change. In an analysis of the ACM conference list, we found that categorizing the degree of change affecting a digital collection over time is a difficult task. More specifically, we found that categorizing this degree of change is not a binary problem where documents are either unchanged or they have changed so dramatically that they do not fit within the scope of the collection. It is, in part, a characterization of the intent of the change. In this work, we examine and categorize the various degrees of change that digital documents endure within the boundaries of an institutionally managed repository.

数字馆藏出现退化并遭受与意外变化相关的问题并不罕见。在对ACM会议列表的分析中，我们发现对影响数字馆藏的变化程度进行分类是一项艰巨的任务。更具体地说，我们发现对这种程度的变化进行分类并不是一个二元问题，即文档要么保持不变，要么变化太大，以至于不适合集合的范围。在某种程度上，这是对变革意图的描述。在这项工作中，我们检查并分类了数字文档在制度管理的存储库边界内承受的不同程度的变化。

引用次数: 2

Online Person Name Disambiguation with Constraints 带有约束的在线人名消歧

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

Pub Date : 2015-06-21 DOI: 10.1145/2756406.2756915

Madian Khabsa, Pucktada Treeratpituk, C. Lee Giles

While many clustering techniques have been successfully applied to the person name disambiguation problem, most do not address two main practical issues: allowing constraints to be added to the clustering process, and allowing the data to be added incrementally without clustering the entire database. Constraints can be particularly useful especially in a system such as a digital library, where users are allowed to make corrections to the disambiguated result. For example, a user correction on a disambiguation result specifying that a record does not belong to an author could be kept as a cannot-link constraint to be used in any future disambiguation (such as when new documents are added). Besides such user corrections, constraints also allow background heuristics to be encoded into the disambiguation process. We propose a constraint-based clustering algorithm for person name disambiguation, based on DBSCAN combined with a pairwise distance based on random forests. We further propose an extension to the density-based clustering algorithm (DBSCAN) to handle online clustering so that the disambiguation process can be done iteratively as new data points are added. Our algorithm utilizes similarity features based on both metadata information and citation similarity. We implement two types of clustering constraints to demonstrate the concept. Experiments on the CiteSeer data show that our model can achieve 0.95 pairwise F1 and 0.79 cluster F1. The presence of constraints also consistently improves the disambiguation result across different combinations of features.

虽然许多聚类技术已经成功地应用于人名消歧问题，但大多数都没有解决两个主要的实际问题:允许在聚类过程中添加约束，以及允许在不聚类整个数据库的情况下增量地添加数据。约束可能特别有用，特别是在数字图书馆这样的系统中，用户可以对消歧结果进行更正。例如，对于指定记录不属于作者的消歧结果的用户更正可以保留为不可链接约束，以便在将来的消歧中使用(例如在添加新文档时)。除了这种用户更正之外，约束还允许将背景启发式编码到消歧过程中。提出了一种基于DBSCAN和基于随机森林的成对距离的人名消歧约束聚类算法。我们进一步提出了一种基于密度的聚类算法(DBSCAN)的扩展，以处理在线聚类，以便在添加新数据点时迭代地进行消歧过程。我们的算法利用了基于元数据信息和引文相似度的相似特征。我们实现了两种类型的聚类约束来演示这个概念。在CiteSeer数据上的实验表明，我们的模型可以达到0.95成对F1和0.79聚类F1。约束的存在也不断提高了跨不同特征组合的消歧结果。

{"title":"Online Person Name Disambiguation with Constraints","authors":"Madian Khabsa, Pucktada Treeratpituk, C. Lee Giles","doi":"10.1145/2756406.2756915","DOIUrl":"https://doi.org/10.1145/2756406.2756915","url":null,"abstract":"While many clustering techniques have been successfully applied to the person name disambiguation problem, most do not address two main practical issues: allowing constraints to be added to the clustering process, and allowing the data to be added incrementally without clustering the entire database. Constraints can be particularly useful especially in a system such as a digital library, where users are allowed to make corrections to the disambiguated result. For example, a user correction on a disambiguation result specifying that a record does not belong to an author could be kept as a cannot-link constraint to be used in any future disambiguation (such as when new documents are added). Besides such user corrections, constraints also allow background heuristics to be encoded into the disambiguation process. We propose a constraint-based clustering algorithm for person name disambiguation, based on DBSCAN combined with a pairwise distance based on random forests. We further propose an extension to the density-based clustering algorithm (DBSCAN) to handle online clustering so that the disambiguation process can be done iteratively as new data points are added. Our algorithm utilizes similarity features based on both metadata information and citation similarity. We implement two types of clustering constraints to demonstrate the concept. Experiments on the CiteSeer data show that our model can achieve 0.95 pairwise F1 and 0.79 cluster F1. The presence of constraints also consistently improves the disambiguation result across different combinations of features.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123368699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Topic Modeling Users' Interpretations of Songs to Inform Subject Access in Music Digital Libraries 主题建模用户对歌曲的解读，为音乐数字图书馆的主题访问提供信息

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

Pub Date : 2015-06-21 DOI: 10.1145/2756406.2756936

Kahyun Choi, Jin Ha Lee, C. Willis, J. S. Downie

The assignment of subject metadata to music is useful for organizing and accessing digital music collections. Since manual subject annotation of large-scale music collections is labor-intensive, automatic methods are preferred. Topic modeling algorithms can be used to automatically identify latent topics from appropriate text sources. Candidate text sources such as song lyrics are often too poetic, resulting in lower-quality topics. Users' interpretations of song lyrics provide an alternative source. In this paper, we propose an automatic topic discovery system from web-mined user-generated interpretations of songs to provide subject access to a music digital library. We also propose and evaluate filtering techniques to identify high-quality topics. In our experiments, we use 24,436 popular songs that exist in both the Million Song Dataset and songmeanings.com. Topic models are generated using Latent Dirichlet Allocation (LDA). To evaluate the coherence of learned topics, we calculate the Normalized Pointwise Mutual Information (NPMI) of the top ten words in each topic based on occurrences in Wikipedia. Finally, we evaluate the resulting topics using a subset of 422 songs that have been manually assigned to six subjects. Using this system, 71% of the manually assigned subjects were correctly identified. These results demonstrate that topic modeling of song interpretations is a promising method for subject metadata enrichment in music digital libraries. It also has implications for affording similar access to collections of poetry and fiction.

将主题元数据分配给音乐对于组织和访问数字音乐收藏很有用。由于大规模音乐收藏的手工主题标注是劳动密集型的，因此首选自动标注方法。主题建模算法可用于从适当的文本源自动识别潜在主题。候选文本来源(如歌词)通常过于诗意，导致质量较低的主题。用户对歌词的解读提供了另一种来源。在本文中，我们提出了一个自动主题发现系统，从网络挖掘的用户生成的歌曲解释，以提供主题访问音乐数字图书馆。我们还提出并评估过滤技术来识别高质量的主题。在我们的实验中，我们使用了存在于百万歌曲数据集和songmeans.com中的24,436首流行歌曲。使用潜狄利克雷分配(Latent Dirichlet Allocation, LDA)生成主题模型。为了评估学习主题的一致性，我们根据维基百科的出现情况计算每个主题中排名前十位的单词的归一化点互信息(NPMI)。最后，我们使用422首歌曲的子集来评估结果主题，这些歌曲已手动分配给六个主题。使用该系统，71%的人工分配的受试者被正确识别。这些结果表明，歌曲解读的主题建模是一种很有前途的音乐数字图书馆主题元数据丰富方法。它也暗示了提供类似的诗歌和小说收藏的途径。

{"title":"Topic Modeling Users' Interpretations of Songs to Inform Subject Access in Music Digital Libraries","authors":"Kahyun Choi, Jin Ha Lee, C. Willis, J. S. Downie","doi":"10.1145/2756406.2756936","DOIUrl":"https://doi.org/10.1145/2756406.2756936","url":null,"abstract":"The assignment of subject metadata to music is useful for organizing and accessing digital music collections. Since manual subject annotation of large-scale music collections is labor-intensive, automatic methods are preferred. Topic modeling algorithms can be used to automatically identify latent topics from appropriate text sources. Candidate text sources such as song lyrics are often too poetic, resulting in lower-quality topics. Users' interpretations of song lyrics provide an alternative source. In this paper, we propose an automatic topic discovery system from web-mined user-generated interpretations of songs to provide subject access to a music digital library. We also propose and evaluate filtering techniques to identify high-quality topics. In our experiments, we use 24,436 popular songs that exist in both the Million Song Dataset and songmeanings.com. Topic models are generated using Latent Dirichlet Allocation (LDA). To evaluate the coherence of learned topics, we calculate the Normalized Pointwise Mutual Information (NPMI) of the top ten words in each topic based on occurrences in Wikipedia. Finally, we evaluate the resulting topics using a subset of 422 songs that have been manually assigned to six subjects. Using this system, 71% of the manually assigned subjects were correctly identified. These results demonstrate that topic modeling of song interpretations is a promising method for subject metadata enrichment in music digital libraries. It also has implications for affording similar access to collections of poetry and fiction.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"294 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124226364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀