Towards automatic Web genre identification: a corpus-based approach in the domain of academia by example of the Academic's Personal Homepage

Georg Rehm
{"title":"Towards automatic Web genre identification: a corpus-based approach in the domain of academia by example of the Academic's Personal Homepage","authors":"Georg Rehm","doi":"10.1109/HICSS.2002.994036","DOIUrl":null,"url":null,"abstract":"We argue for a systematic analysis of one particular, well structured domain -academic Web pages - with regard to a special class of digital genres: Web genres. For this purpose, we have developed a database-driven system that will ultimately consist of more than 3000000 HTML documents, written in German, which are the empirical basis for our research. We introduce the notions of Web genre type which constitutes the basic framework for a certain Web genre, and compulsory and optional Web genre modules. These act as building blocks which go together to make up the structure characterised by the Web genre type and furthermore, operate as modifiers for the default assignment involved. The analysis of a 200 document sample illustrates our notion of Web genre hierarchy, into which Web genre types and modules are embedded. The analysis of four different documents of the Web genre Academic's Personal Homepage, not only illustrates our approach, but also our long-term goal of automatically extracting the contents of Web genre modules in order to build up structured XML documents of groups of unstructured HTML documents.","PeriodicalId":366006,"journal":{"name":"Proceedings of the 35th Annual Hawaii International Conference on System Sciences","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"63","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 35th Annual Hawaii International Conference on System Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HICSS.2002.994036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 63

Abstract

We argue for a systematic analysis of one particular, well structured domain -academic Web pages - with regard to a special class of digital genres: Web genres. For this purpose, we have developed a database-driven system that will ultimately consist of more than 3000000 HTML documents, written in German, which are the empirical basis for our research. We introduce the notions of Web genre type which constitutes the basic framework for a certain Web genre, and compulsory and optional Web genre modules. These act as building blocks which go together to make up the structure characterised by the Web genre type and furthermore, operate as modifiers for the default assignment involved. The analysis of a 200 document sample illustrates our notion of Web genre hierarchy, into which Web genre types and modules are embedded. The analysis of four different documents of the Web genre Academic's Personal Homepage, not only illustrates our approach, but also our long-term goal of automatically extracting the contents of Web genre modules in order to build up structured XML documents of groups of unstructured HTML documents.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
走向自动网络体裁识别:以学术个人主页为例,在学术领域中基于语料库的方法
我们主张对一个特定的、结构良好的领域——学术网页——进行系统分析,并将其与一类特殊的数字类型——网络类型联系起来。为此,我们开发了一个数据库驱动的系统,该系统最终将由300多万份用德语编写的HTML文档组成,这是我们研究的经验基础。介绍了构成网络类型基本框架的网络类型的概念,以及网络类型的必修模块和可选模块。它们就像构建块一样,组合在一起构成了Web类型的结构特征,而且,还可以作为所涉及的默认分配的修饰语。对200个文档样本的分析说明了Web类型层次结构的概念,其中嵌入了Web类型类型和模块。通过对Web类型Academic's Personal Homepage的四个不同文档的分析,不仅说明了我们的方法,而且说明了我们的长期目标,即自动提取Web类型模块的内容,以便从一组非结构化HTML文档中构建结构化XML文档。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Healthcare chain workflow management by use of IT Exploiting soft systems methodology (SSM) and knowledge types to facilitate knowledge capture issues in a Web Site environment Nash strategies for load serving entities in dynamic energy multi-markets Global applications of collaborative technology Resource allocation in networks: a case study of the influence model
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1