基于结构信息和Doc2vec的科技论文文本表示

Yonghe Lu, Yuanyuan Zhai, Jiayi Luo, Yongshan Chen
{"title":"基于结构信息和Doc2vec的科技论文文本表示","authors":"Yonghe Lu, Yuanyuan Zhai, Jiayi Luo, Yongshan Chen","doi":"10.11648/J.AJIST.20190303.12","DOIUrl":null,"url":null,"abstract":"Text representation is the key for text processing. Scientific papers have significant structural features. The different internal components, mainly including titles, abstracts, keywords, main texts, etc., embody different degrees of importance. In addition, the external structural features of scientific papers, such as topics and authors, also have certain value for analysis of scientific papers. However, most of the traditional analysis methods of scientific papers are based on the analysis of keyword co-occurrence and citation links, which only consider partial information. There is a lack of research on the textual information and external structural information of scientific papers, which has led to the inability to deeply explore the inherent laws of scientific papers. Therefore, this paper proposes Multi-Layers Paragraph Vector (MLPV), a text representing method for scientific papers based on Doc2vec and structural information of scientific papers including both internal and external structures, and constructs five text representation models: PV-NO, PV-TOP, PV-TAKM, MLPV and MLPV-PSO. The results show that the effect of the MLPV model is much better than the PV-NO, PV-TOP and PV-TAKM models. The average accuracy of MLPV model is much more stable and higher, reaching 91.71%, which proves its validity. On the basis of the MLPV model, the accuracy of the optimized MLPV-PSO model is 3.33% higher than MLPV model which proves the effectiveness of the optimization algorithm.","PeriodicalId":50013,"journal":{"name":"Journal of the American Society for Information Science and Technology","volume":"30 9","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"MLPV: Text Representation of Scientific Papers Based on Structural Information and Doc2vec\",\"authors\":\"Yonghe Lu, Yuanyuan Zhai, Jiayi Luo, Yongshan Chen\",\"doi\":\"10.11648/J.AJIST.20190303.12\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text representation is the key for text processing. Scientific papers have significant structural features. The different internal components, mainly including titles, abstracts, keywords, main texts, etc., embody different degrees of importance. In addition, the external structural features of scientific papers, such as topics and authors, also have certain value for analysis of scientific papers. However, most of the traditional analysis methods of scientific papers are based on the analysis of keyword co-occurrence and citation links, which only consider partial information. There is a lack of research on the textual information and external structural information of scientific papers, which has led to the inability to deeply explore the inherent laws of scientific papers. Therefore, this paper proposes Multi-Layers Paragraph Vector (MLPV), a text representing method for scientific papers based on Doc2vec and structural information of scientific papers including both internal and external structures, and constructs five text representation models: PV-NO, PV-TOP, PV-TAKM, MLPV and MLPV-PSO. The results show that the effect of the MLPV model is much better than the PV-NO, PV-TOP and PV-TAKM models. The average accuracy of MLPV model is much more stable and higher, reaching 91.71%, which proves its validity. On the basis of the MLPV model, the accuracy of the optimized MLPV-PSO model is 3.33% higher than MLPV model which proves the effectiveness of the optimization algorithm.\",\"PeriodicalId\":50013,\"journal\":{\"name\":\"Journal of the American Society for Information Science and Technology\",\"volume\":\"30 9\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the American Society for Information Science and Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.11648/J.AJIST.20190303.12\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Society for Information Science and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11648/J.AJIST.20190303.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

文本表示是文本处理的关键。科学论文具有显著的结构特征。不同的内部成分,主要包括标题、摘要、关键词、主要文本等,体现了不同的重要程度。此外,科学论文的外部结构特征,如主题、作者等,对科学论文的分析也有一定的价值。然而,传统的科技论文分析方法大多是基于关键词共现和引文链接的分析,只考虑了部分信息。对科技论文的文本信息和外部结构信息的研究不足,导致无法深入探索科技论文的内在规律。为此,本文提出了一种基于Doc2vec和科技论文内外结构信息的文本表示方法——多层段落向量(Multi-Layers Paragraph Vector, MLPV),并构建了PV-NO、PV-TOP、PV-TAKM、MLPV和MLPV- pso五个文本表示模型。结果表明,MLPV模型的效果明显优于PV-NO、PV-TOP和PV-TAKM模型。MLPV模型的平均准确率达到91.71%,更加稳定,也更高,证明了该模型的有效性。在MLPV模型的基础上,优化后的MLPV- pso模型的准确率比MLPV模型高3.33%,证明了优化算法的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
MLPV: Text Representation of Scientific Papers Based on Structural Information and Doc2vec
Text representation is the key for text processing. Scientific papers have significant structural features. The different internal components, mainly including titles, abstracts, keywords, main texts, etc., embody different degrees of importance. In addition, the external structural features of scientific papers, such as topics and authors, also have certain value for analysis of scientific papers. However, most of the traditional analysis methods of scientific papers are based on the analysis of keyword co-occurrence and citation links, which only consider partial information. There is a lack of research on the textual information and external structural information of scientific papers, which has led to the inability to deeply explore the inherent laws of scientific papers. Therefore, this paper proposes Multi-Layers Paragraph Vector (MLPV), a text representing method for scientific papers based on Doc2vec and structural information of scientific papers including both internal and external structures, and constructs five text representation models: PV-NO, PV-TOP, PV-TAKM, MLPV and MLPV-PSO. The results show that the effect of the MLPV model is much better than the PV-NO, PV-TOP and PV-TAKM models. The average accuracy of MLPV model is much more stable and higher, reaching 91.71%, which proves its validity. On the basis of the MLPV model, the accuracy of the optimized MLPV-PSO model is 3.33% higher than MLPV model which proves the effectiveness of the optimization algorithm.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
审稿时长
3.5 months
期刊最新文献
Information Resources Management in the Twenty-First Century: Challenges, Prospects, and the Librarian’s Role Technical Infrastructure to Support Public Value Co-creation in Smart City Perceived Usefulness of Web 2.0 Tools for Knowledge Management by University Undergraduate Students: A Review of Literature Group Emotion Recognition for Weibo Topics Based on BERT with TextCNN Research on the Service of Special Collections of University Libraries Empowered by Intelligent Media
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1