Exploring the Potentialities of Automatic Extraction of University Webometric Information

Journal of data and information science (Warsaw, Poland) Pub Date : 2020-11-01 DOI:10.2478/jdis-2020-0040

Gianpiero Bianchi, R. Bruni, C. Daraio, A. Palma, G. Perani, Francesco Scalfati

{"title":"Exploring the Potentialities of Automatic Extraction of University Webometric Information","authors":"Gianpiero Bianchi, R. Bruni, C. Daraio, A. Palma, G. Perani, Francesco Scalfati","doi":"10.2478/jdis-2020-0040","DOIUrl":null,"url":null,"abstract":"Abstract Purpose The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’ websites. The information automatically extracted can be potentially updated with a frequency higher than once per year, and be safe from manipulations or misinterpretations. Moreover, this approach allows us flexibility in collecting indicators about the efficiency of universities’ websites and their effectiveness in disseminating key contents. These new indicators can complement traditional indicators of scientific research (e.g. number of articles and number of citations) and teaching (e.g. number of students and graduates) by introducing further dimensions to allow new insights for “profiling” the analyzed universities. Design/methodology/approach Webometrics relies on web mining methods and techniques to perform quantitative analyses of the web. This study implements an advanced application of the webometric approach, exploiting all the three categories of web mining: web content mining; web structure mining; web usage mining. The information to compute our indicators has been extracted from the universities’ websites by using web scraping and text mining techniques. The scraped information has been stored in a NoSQL DB according to a semi-structured form to allow for retrieving information efficiently by text mining techniques. This provides increased flexibility in the design of new indicators, opening the door to new types of analyses. Some data have also been collected by means of batch interrogations of search engines (Bing, www.bing.com) or from a leading provider of Web analytics (SimilarWeb, http://www.similarweb.com). The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register (https://eter.joanneum.at/#/home), a database collecting information on Higher Education Institutions (HEIs) at European level. All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators. Findings The main findings of this study concern the evaluation of the potential in digitalization of universities, in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’ websites. These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators. Research limitations The results reported in this study refers to Italian universities only, but the approach could be extended to other university systems abroad. Practical implications The approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites. The approach could be applied to other university systems. Originality/value This work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping, optical character recognition and nontrivial text mining operations (Bruni & Bianchi, 2020).","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"5 1","pages":"43 - 55"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of data and information science (Warsaw, Poland)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/jdis-2020-0040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Abstract Purpose The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’ websites. The information automatically extracted can be potentially updated with a frequency higher than once per year, and be safe from manipulations or misinterpretations. Moreover, this approach allows us flexibility in collecting indicators about the efficiency of universities’ websites and their effectiveness in disseminating key contents. These new indicators can complement traditional indicators of scientific research (e.g. number of articles and number of citations) and teaching (e.g. number of students and graduates) by introducing further dimensions to allow new insights for “profiling” the analyzed universities. Design/methodology/approach Webometrics relies on web mining methods and techniques to perform quantitative analyses of the web. This study implements an advanced application of the webometric approach, exploiting all the three categories of web mining: web content mining; web structure mining; web usage mining. The information to compute our indicators has been extracted from the universities’ websites by using web scraping and text mining techniques. The scraped information has been stored in a NoSQL DB according to a semi-structured form to allow for retrieving information efficiently by text mining techniques. This provides increased flexibility in the design of new indicators, opening the door to new types of analyses. Some data have also been collected by means of batch interrogations of search engines (Bing, www.bing.com) or from a leading provider of Web analytics (SimilarWeb, http://www.similarweb.com). The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register (https://eter.joanneum.at/#/home), a database collecting information on Higher Education Institutions (HEIs) at European level. All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators. Findings The main findings of this study concern the evaluation of the potential in digitalization of universities, in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’ websites. These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators. Research limitations The results reported in this study refers to Italian universities only, but the approach could be extended to other university systems abroad. Practical implications The approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites. The approach could be applied to other university systems. Originality/value This work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping, optical character recognition and nontrivial text mining operations (Bruni & Bianchi, 2020).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

探索大学网络测量信息自动提取的潜力

摘要目的本研究的主要目的是展示最近开发的直接从大学网站自动提取知识的方法的潜力。自动提取的信息可能以高于每年一次的频率更新，并且不会被操纵或误解。此外，这种方法使我们能够灵活地收集有关大学网站效率及其传播关键内容的有效性的指标。这些新指标可以补充传统的科学研究指标(如文章数量和引用次数)和教学指标(如学生和毕业生数量)，通过引入更多的维度，为“分析”所分析的大学提供新的见解。设计/方法/方法网络计量学依赖于网络挖掘方法和技术来执行网络的定量分析。本研究实现了web计量方法的高级应用，利用了web挖掘的所有三个类别:web内容挖掘;Web结构挖掘;Web使用挖掘。用于计算我们指标的信息是通过网络抓取和文本挖掘技术从大学网站中提取出来的。抓取的信息按照半结构化的形式存储在NoSQL数据库中，以便通过文本挖掘技术有效地检索信息。这为设计新指标提供了更大的灵活性，为新型分析打开了大门。一些数据也是通过搜索引擎(Bing, www.bing.com)或领先的网络分析提供商(SimilarWeb, http://www.similarweb.com)的批量查询收集的。从网上提取的信息与从欧洲高等教育注册(https://eter.joanneum.at/#/home)获取的大学结构信息相结合，这是一个收集欧洲高等教育机构(HEIs)信息的数据库。根据结构和数字指标，上述所有因素被用于对79所意大利大学进行聚类。本研究的主要发现涉及对大学数字化潜力的评估，特别是通过介绍从网络中自动提取信息的技术来建立大学网站质量和影响的指标。这些指标可以作为传统指标的补充，并可以使用与上述指标相结合的聚类技术来识别具有共同特征的大学群体。本研究报告的结果仅涉及意大利的大学，但该方法可以推广到国外其他大学系统。本研究中提出的方法及其对意大利大学的说明显示了最近引入的自动数据提取和网络抓取方法的有用性，以及它在描述和分析大学网站活动方面的实际意义。这种方法可以应用于其他大学系统。这项工作首次应用于大学网站，一些最近引入的基于网络抓取、光学字符识别和非平凡文本挖掘操作的自动知识提取技术(Bruni & Bianchi, 2020)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of data and information science (Warsaw, Poland)

自引率

0.00%

发文量