{"title":"文本类型的结构分类器:迈向一种新的文本表示模型","authors":"Alexander Mehler, Peter Geibel, O. Pustylnikov","doi":"10.21248/jlcl.22.2007.95","DOIUrl":null,"url":null,"abstract":"Texts can be distinguished in terms of their content, function, structure or layout (Brinker, 1992; Bateman et al., 2001; Joachims, 2002; Power et al., 2003). These reference points do not open necessarily orthogonal perspectives on text classification. As part of explorative data analysis, text classification aims at automatically dividing sets of textual objects into classes of maximum internal homogeneity and external heterogeneity. This paper deals with classifying texts into text types whose instances serve more or less homogeneous functions. Other than mainstream approaches, which rely on the vector space model (Sebastiani, 2002) or some of its descendants (Baeza-Yates and Ribeiro-Neto, 1999) and, thus, on content-related lexical features, we solely refer to structural dierentiae. That is, we explore patterns of text structure as determinants of class membership. Our starting point are tree-like text representations which induce feature vectors and tree kernels. These kernels are utilized in supervised learning based on cross-validation as a method of model selection (Hastie et al., 2001) by example of a corpus of press communication. For a subset of categories we show that classification can be performed very well by structural dierentia only.","PeriodicalId":346957,"journal":{"name":"LDV Forum","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":"{\"title\":\"Structural Classifiers of Text Types: Towards a Novel Model of Text Representation\",\"authors\":\"Alexander Mehler, Peter Geibel, O. Pustylnikov\",\"doi\":\"10.21248/jlcl.22.2007.95\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Texts can be distinguished in terms of their content, function, structure or layout (Brinker, 1992; Bateman et al., 2001; Joachims, 2002; Power et al., 2003). These reference points do not open necessarily orthogonal perspectives on text classification. As part of explorative data analysis, text classification aims at automatically dividing sets of textual objects into classes of maximum internal homogeneity and external heterogeneity. This paper deals with classifying texts into text types whose instances serve more or less homogeneous functions. Other than mainstream approaches, which rely on the vector space model (Sebastiani, 2002) or some of its descendants (Baeza-Yates and Ribeiro-Neto, 1999) and, thus, on content-related lexical features, we solely refer to structural dierentiae. That is, we explore patterns of text structure as determinants of class membership. Our starting point are tree-like text representations which induce feature vectors and tree kernels. These kernels are utilized in supervised learning based on cross-validation as a method of model selection (Hastie et al., 2001) by example of a corpus of press communication. For a subset of categories we show that classification can be performed very well by structural dierentia only.\",\"PeriodicalId\":346957,\"journal\":{\"name\":\"LDV Forum\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"32\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"LDV Forum\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21248/jlcl.22.2007.95\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"LDV Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21248/jlcl.22.2007.95","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 32
摘要
文本可以根据其内容、功能、结构或布局来区分(Brinker, 1992;贝特曼等人,2001;约阿希姆,2002;Power et al., 2003)。这些参考点并不一定打开文本分类的正交视角。作为探索性数据分析的一部分,文本分类旨在将文本对象集自动划分为最大内部同质性和最大外部异质性的类别。本文讨论将文本分类为文本类型,这些文本类型的实例或多或少具有同质功能。主流方法依赖于向量空间模型(Sebastiani, 2002)或它的一些后代(Baeza-Yates和Ribeiro-Neto, 1999),因此,与内容相关的词汇特征不同,我们只参考结构差异。也就是说,我们探索文本结构模式作为阶级成员的决定因素。我们的出发点是树状的文本表示,它引出特征向量和树核。这些核被用于基于交叉验证的监督学习中,作为模型选择的一种方法(Hastie等人,2001),以新闻传播语料库为例。对于类别的一个子集,我们表明仅通过结构差异可以很好地执行分类。
Structural Classifiers of Text Types: Towards a Novel Model of Text Representation
Texts can be distinguished in terms of their content, function, structure or layout (Brinker, 1992; Bateman et al., 2001; Joachims, 2002; Power et al., 2003). These reference points do not open necessarily orthogonal perspectives on text classification. As part of explorative data analysis, text classification aims at automatically dividing sets of textual objects into classes of maximum internal homogeneity and external heterogeneity. This paper deals with classifying texts into text types whose instances serve more or less homogeneous functions. Other than mainstream approaches, which rely on the vector space model (Sebastiani, 2002) or some of its descendants (Baeza-Yates and Ribeiro-Neto, 1999) and, thus, on content-related lexical features, we solely refer to structural dierentiae. That is, we explore patterns of text structure as determinants of class membership. Our starting point are tree-like text representations which induce feature vectors and tree kernels. These kernels are utilized in supervised learning based on cross-validation as a method of model selection (Hastie et al., 2001) by example of a corpus of press communication. For a subset of categories we show that classification can be performed very well by structural dierentia only.