Inter-document reference detection as an alternative to full text semantic analysis in document clustering

IEEE International Workshop on Machine Learning for Signal Processing : [proceedings]. IEEE International Workshop on Machine Learning for Signal Processing Pub Date : 2013-01-01 DOI:10.1109/MLSP.2013.6661952

P. D. Mazière, M. Hulle

{"title":"Inter-document reference detection as an alternative to full text semantic analysis in document clustering","authors":"P. D. Mazière, M. Hulle","doi":"10.1109/MLSP.2013.6661952","DOIUrl":null,"url":null,"abstract":"We discuss here the search for inter-document references as an alternative to the grouping of document inventories based on a full text semantic analysis. The used document inventory, which is not publicly available, was provided to us by the European Union (EU) in the framework of an EU project, the aim of which was to analyse, classify, and visualise EU funded research in social sciences and humanities in EU framework programmes FP5 and FP6. This project, called the SSH project for short, was aimed at the evaluation of the contributions of research to the development of EU policies. For the semantic based grouping, we start from a Multi-Dimensional Scaling analysis of the document vectors, which is the result of a prior semantic analysis. As an alternative to a semantic analysis, we searched for inter-document references or direct references. Direct references are defined as terms that explicitly refer to other documents present in the inventory. We show that the grouping based on references is largely similar to the one based on semantics, but with considerably less computational efforts. In addition, the non-expert can make better use of the results, since the references are displayed as graphical webpages with hyperlinks pointing to both the referenced and the referencing document(s), and the reason of linkage. Finally, we show that the combination of a database, to store the data and the (intermediate) results, and a webserver, to visualise the results, offers a powerful platform to analyse the document inventory and to share the results with all participants/collaborators involved in a data- and computation intensive EU-project, thereby guaranteeing both data- and result-consistency.","PeriodicalId":73290,"journal":{"name":"IEEE International Workshop on Machine Learning for Signal Processing : [proceedings]. IEEE International Workshop on Machine Learning for Signal Processing","volume":"75 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2013-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Workshop on Machine Learning for Signal Processing : [proceedings]. IEEE International Workshop on Machine Learning for Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MLSP.2013.6661952","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We discuss here the search for inter-document references as an alternative to the grouping of document inventories based on a full text semantic analysis. The used document inventory, which is not publicly available, was provided to us by the European Union (EU) in the framework of an EU project, the aim of which was to analyse, classify, and visualise EU funded research in social sciences and humanities in EU framework programmes FP5 and FP6. This project, called the SSH project for short, was aimed at the evaluation of the contributions of research to the development of EU policies. For the semantic based grouping, we start from a Multi-Dimensional Scaling analysis of the document vectors, which is the result of a prior semantic analysis. As an alternative to a semantic analysis, we searched for inter-document references or direct references. Direct references are defined as terms that explicitly refer to other documents present in the inventory. We show that the grouping based on references is largely similar to the one based on semantics, but with considerably less computational efforts. In addition, the non-expert can make better use of the results, since the references are displayed as graphical webpages with hyperlinks pointing to both the referenced and the referencing document(s), and the reason of linkage. Finally, we show that the combination of a database, to store the data and the (intermediate) results, and a webserver, to visualise the results, offers a powerful platform to analyse the document inventory and to share the results with all participants/collaborators involved in a data- and computation intensive EU-project, thereby guaranteeing both data- and result-consistency.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

文档间引用检测作为文档聚类中全文语义分析的替代方法

我们在这里讨论文档间引用的搜索，作为基于全文语义分析的文档清单分组的替代方案。未公开使用的文献清单是由欧盟(EU)在一个欧盟项目框架内提供给我们的，该项目的目的是分析、分类和可视化欧盟框架计划FP5和FP6中欧盟资助的社会科学和人文科学研究。该项目简称为SSH项目，旨在评估研究对欧盟政策发展的贡献。对于基于语义的分组，我们从文档向量的多维尺度分析开始，这是先验语义分析的结果。作为语义分析的替代方法，我们搜索文档间引用或直接引用。直接引用被定义为明确引用库存中存在的其他文档的术语。我们展示了基于引用的分组与基于语义的分组在很大程度上相似，但计算工作量要少得多。此外，非专家可以更好地利用结果，因为参考文献显示为图形网页，超链接指向被引用文献和引用文献，以及链接的原因。最后，我们展示了存储数据和(中间)结果的数据库和可视化结果的web服务器的组合，提供了一个强大的平台来分析文档库存，并与参与数据和计算密集型欧盟项目的所有参与者/合作者共享结果，从而保证了数据和结果的一致性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE International Workshop on Machine Learning for Signal Processing : [proceedings]. IEEE International Workshop on Machine Learning for Signal Processing

自引率

0.00%

发文量