{"title":"基于词与句子匹配的广播新闻故事聚类","authors":"Foong Kuin Yow, T. Tan","doi":"10.1109/IALP.2013.62","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a rule-based approach that uses the term and sentence matching criteria for clustering Malay broadcast news to different stories. The proposed clustering method does not require users to predefined number of clusters. The three main stages of the clustering are sentences segmentation, indexing, and also term and sentence matching clustering. The sentences in the transcription will be segmented before indexing. Indexing involves tokenization, stop word removal, stemming, term selection and term representation. A vector space model (VSM) is used to represent the terms and sentences in the form of vectors. The sentences will then be grouped into clusters by using term and sentence matching thresholds. The proposed approach shows a significantly better accuracy than the baseline approaches.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Broadcast News Story Clustering via Term and Sentence Matching\",\"authors\":\"Foong Kuin Yow, T. Tan\",\"doi\":\"10.1109/IALP.2013.62\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a rule-based approach that uses the term and sentence matching criteria for clustering Malay broadcast news to different stories. The proposed clustering method does not require users to predefined number of clusters. The three main stages of the clustering are sentences segmentation, indexing, and also term and sentence matching clustering. The sentences in the transcription will be segmented before indexing. Indexing involves tokenization, stop word removal, stemming, term selection and term representation. A vector space model (VSM) is used to represent the terms and sentences in the form of vectors. The sentences will then be grouped into clusters by using term and sentence matching thresholds. The proposed approach shows a significantly better accuracy than the baseline approaches.\",\"PeriodicalId\":413833,\"journal\":{\"name\":\"2013 International Conference on Asian Language Processing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-08-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 International Conference on Asian Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IALP.2013.62\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Asian Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2013.62","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Broadcast News Story Clustering via Term and Sentence Matching
In this paper, we propose a rule-based approach that uses the term and sentence matching criteria for clustering Malay broadcast news to different stories. The proposed clustering method does not require users to predefined number of clusters. The three main stages of the clustering are sentences segmentation, indexing, and also term and sentence matching clustering. The sentences in the transcription will be segmented before indexing. Indexing involves tokenization, stop word removal, stemming, term selection and term representation. A vector space model (VSM) is used to represent the terms and sentences in the form of vectors. The sentences will then be grouped into clusters by using term and sentence matching thresholds. The proposed approach shows a significantly better accuracy than the baseline approaches.