Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587805
Svetlana Machova, J. Klecková
This paper brings conceptually new, empirically based scientific approach to a deeper understanding of human mind cognition, language acquisition, modularity of language and language origin itself. The research presented provides an interactive multilingual associative experiment as an attempt to map the Cognitive Semantic Space: (CSSES) and its basic frames of the Essential Self in the Czech language, collects and compares it to the CSSES of conceptual language view in Czech, Russian, English and potentially in other languages. We attempt to merge cognitive metaphor theory with psycholinguistics and psychoanalysis applying associative experiment methodology on the Essential Self metaphors. The research has two main goals: the first is to build an Essential Self multilingual WordNet, which serves as the basic lexical resource for Artificial Intelligence describes the core of the human nature. The second is to create a multilingual 3D semantic network.
{"title":"Are we waves or are we particles? A new insight into deep semantics in natural language processing","authors":"Svetlana Machova, J. Klecková","doi":"10.1109/NLPKE.2010.5587805","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587805","url":null,"abstract":"This paper brings conceptually new, empirically based scientific approach to a deeper understanding of human mind cognition, language acquisition, modularity of language and language origin itself. The research presented provides an interactive multilingual associative experiment as an attempt to map the Cognitive Semantic Space: (CSSES) and its basic frames of the Essential Self in the Czech language, collects and compares it to the CSSES of conceptual language view in Czech, Russian, English and potentially in other languages. We attempt to merge cognitive metaphor theory with psycholinguistics and psychoanalysis applying associative experiment methodology on the Essential Self metaphors. The research has two main goals: the first is to build an Essential Self multilingual WordNet, which serves as the basic lexical resource for Artificial Intelligence describes the core of the human nature. The second is to create a multilingual 3D semantic network.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129526111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587840
Hanyue Yang, Xiaorong Chen
Shape of Shui nationality characters is similar to that of Oracle and Jinwen. In order to work out the problems of how to code hieroglyph, a coding method based on stroke shape for Shui Nationality characters is proposed. The shapes of 467 Shui Nationality characters in the Common Shui Script Dictionary are analyzed, and seven basic strokes are extracted to consist of main Shui characters. Through the statistical comparison, 21 kinds of stroke shape can be got by subdividing the seven basic strokes. A Shui Nationality character is coded by an ordered sequence composed by three strokes taken from the corner of the character according to the coding rules. Finally, the users who can not read the Shui character can input it easily and quickly.
{"title":"Shui nationality characters stroke shape input method","authors":"Hanyue Yang, Xiaorong Chen","doi":"10.1109/NLPKE.2010.5587840","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587840","url":null,"abstract":"Shape of Shui nationality characters is similar to that of Oracle and Jinwen. In order to work out the problems of how to code hieroglyph, a coding method based on stroke shape for Shui Nationality characters is proposed. The shapes of 467 Shui Nationality characters in the Common Shui Script Dictionary are analyzed, and seven basic strokes are extracted to consist of main Shui characters. Through the statistical comparison, 21 kinds of stroke shape can be got by subdividing the seven basic strokes. A Shui Nationality character is coded by an ordered sequence composed by three strokes taken from the corner of the character according to the coding rules. Finally, the users who can not read the Shui character can input it easily and quickly.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130918935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587776
Liping Wu, Song Liu, F. Ren
In this paper, we propose a novel information retrieval approach based on the pragmatic information for Chinese patents. At present, patent retrieval is becoming more and more important. Not only because patents are always can an important resource in all kinds of field, but patent retrieval save a great deal of time and funds for corporations and researchers. However, with available methods the precision of retrieval results for patents is not very high. What's more, through analyzed the patent documentations we found that except the literal meanings, there are deeper meanings which can be concluded from the patents. Here we call the deeper meanings as pragmatic information. Therefore we established a patent retrieval system to integrate the pragmatic information with classical information retrieval technique to improve the retrieval accuracy. Some experiments using the proposed method have carried out, and the results show that the precision of patent retrieval based on the pragmatic information is higher than the one without using it.
{"title":"Chinese patent retrieval based on the pragmatic information","authors":"Liping Wu, Song Liu, F. Ren","doi":"10.1109/NLPKE.2010.5587776","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587776","url":null,"abstract":"In this paper, we propose a novel information retrieval approach based on the pragmatic information for Chinese patents. At present, patent retrieval is becoming more and more important. Not only because patents are always can an important resource in all kinds of field, but patent retrieval save a great deal of time and funds for corporations and researchers. However, with available methods the precision of retrieval results for patents is not very high. What's more, through analyzed the patent documentations we found that except the literal meanings, there are deeper meanings which can be concluded from the patents. Here we call the deeper meanings as pragmatic information. Therefore we established a patent retrieval system to integrate the pragmatic information with classical information retrieval technique to improve the retrieval accuracy. Some experiments using the proposed method have carried out, and the results show that the precision of patent retrieval based on the pragmatic information is higher than the one without using it.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125546479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587771
Tao-Hsing Chang, Fu-Yuan Hsu, Chia-Hoang Lee, Hahn-Ming Lee
Many studies have tried to search useful information on the Internet by meaningful terms or words. The performance of these approaches is often affected by the accuracy of unknown word extraction and POS tagging, while the accuracy is affected by the size of training corpora and the characteristics of language. This work proposes and develops a method that concentrates on tagging the POS of Chinese unknown words for the domain of our interest, based on the integration of morphological, contextual rules and a statistics-based method. Experimental results indicate that the proposed method can overcome the difficulties resulting from small corpora in oriental languages, and can accurately tags unknown words with POS in domain-specific small corpora.
{"title":"Part-of-speech tagging for Chinese unknown words in a domain-specific small corpus using morphological and contextual rules","authors":"Tao-Hsing Chang, Fu-Yuan Hsu, Chia-Hoang Lee, Hahn-Ming Lee","doi":"10.1109/NLPKE.2010.5587771","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587771","url":null,"abstract":"Many studies have tried to search useful information on the Internet by meaningful terms or words. The performance of these approaches is often affected by the accuracy of unknown word extraction and POS tagging, while the accuracy is affected by the size of training corpora and the characteristics of language. This work proposes and develops a method that concentrates on tagging the POS of Chinese unknown words for the domain of our interest, based on the integration of morphological, contextual rules and a statistics-based method. Experimental results indicate that the proposed method can overcome the difficulties resulting from small corpora in oriental languages, and can accurately tags unknown words with POS in domain-specific small corpora.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125283724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587850
Qiaoli Zhou, Yue Gu, Xin Liu, Wenjing Lang, Dongfeng Cai
According to the characteristics of Chinese language, this paper proposes a statistical parsing method based on Maximal Noun Phrase(MNP) per-processing. MNP parsing is preferable to be separated from parsing of the full sentence. Firstly, MNP in a sentence are identified; next, MNP can be represented by the head of MNP, and then the sentence is parsed with the head of the MNP. Therefore, the original sentence is divided into two parts, which can be parsed separately. The first part is MNP parsing; the second part is parsing of the sentence in which the MNP are replaced by their head words. Finally, the paper takes Conditional Random Fields (CRFs) as the statistical recognition model of each level in syntactic parsing process.
根据汉语的特点,提出了一种基于最大名词短语预处理的统计句法分析方法。MNP解析最好与整个句子的解析分开。首先,识别句子中的MNP;然后,MNP可以用MNP的头部来表示,然后用MNP的头部来解析句子。因此,将原句分成两部分,可以分别解析。第一部分是MNP解析;第二部分是句子的解析,其中MNP被它们的头词所取代。最后,本文将条件随机场(Conditional Random Fields, CRFs)作为句法解析过程中各个层次的统计识别模型。
{"title":"Statistical parsing based on Maximal Noun Phrase pre-processing","authors":"Qiaoli Zhou, Yue Gu, Xin Liu, Wenjing Lang, Dongfeng Cai","doi":"10.1109/NLPKE.2010.5587850","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587850","url":null,"abstract":"According to the characteristics of Chinese language, this paper proposes a statistical parsing method based on Maximal Noun Phrase(MNP) per-processing. MNP parsing is preferable to be separated from parsing of the full sentence. Firstly, MNP in a sentence are identified; next, MNP can be represented by the head of MNP, and then the sentence is parsed with the head of the MNP. Therefore, the original sentence is divided into two parts, which can be parsed separately. The first part is MNP parsing; the second part is parsing of the sentence in which the MNP are replaced by their head words. Finally, the paper takes Conditional Random Fields (CRFs) as the statistical recognition model of each level in syntactic parsing process.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127018013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587842
Haibo Ding, Muhua Zhu, Jingbo Zhu
In the field of natural language processing (NLP), there often exist multiple corpora with different annotation standards for the same task. In this paper, we take syntactic parsing as a case study and propose a reranking method which is able to make direct use of disparate treebanks simultaneously without using techniques such as treebank conversion. The method proceeds in three steps: 1) build parsers on individual treebanks; 2) use parsers independently to generate n-best lists for each sentence in test set; 3) rerank individual n-best lists which correspond to the same sentence by using consensus information exchanged among these n-best lists. Experimental results on two open Chinese treebanks show that our method significantly outperforms the baseline system by 0.84% and 0.53% respectively.
{"title":"A reranking method for syntactic parsing with heterogeneous treebanks","authors":"Haibo Ding, Muhua Zhu, Jingbo Zhu","doi":"10.1109/NLPKE.2010.5587842","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587842","url":null,"abstract":"In the field of natural language processing (NLP), there often exist multiple corpora with different annotation standards for the same task. In this paper, we take syntactic parsing as a case study and propose a reranking method which is able to make direct use of disparate treebanks simultaneously without using techniques such as treebank conversion. The method proceeds in three steps: 1) build parsers on individual treebanks; 2) use parsers independently to generate n-best lists for each sentence in test set; 3) rerank individual n-best lists which correspond to the same sentence by using consensus information exchanged among these n-best lists. Experimental results on two open Chinese treebanks show that our method significantly outperforms the baseline system by 0.84% and 0.53% respectively.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123574465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587778
Yasushi Katsura, Kazuyuki Matsumoto, F. Ren
With development of the recent globalization, the chance to exchange in English increased in the business field. In particular, it's necessary to write a thesis and a charter handwriting in English. Because many Japanese are not used to making English sentence, it is a great burden to write appropriate sentence in English without any support for creating English sentence. In this study we have developed an English composition support system. By this system, it's to search for the interlinear translation example to refer to by database and generate a new sentence by replacing a noun in the example sentence. In this paper, based on the technique of Super-Function, we propose a method to convert an affirmative sentence into negative sentence and vice versa to realize more flexible and extensive text conversion.
{"title":"Flexible English writing support based on negative-positive conversion method","authors":"Yasushi Katsura, Kazuyuki Matsumoto, F. Ren","doi":"10.1109/NLPKE.2010.5587778","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587778","url":null,"abstract":"With development of the recent globalization, the chance to exchange in English increased in the business field. In particular, it's necessary to write a thesis and a charter handwriting in English. Because many Japanese are not used to making English sentence, it is a great burden to write appropriate sentence in English without any support for creating English sentence. In this study we have developed an English composition support system. By this system, it's to search for the interlinear translation example to refer to by database and generate a new sentence by replacing a noun in the example sentence. In this paper, based on the technique of Super-Function, we propose a method to convert an affirmative sentence into negative sentence and vice versa to realize more flexible and extensive text conversion.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121698126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587801
Hui-Ngo Goh, Ching Kiu
Ontology construction often requires a domain specific corpus in conceptualizing the domain knowledge; specifically, it is an association of terms, relation between terms and related instances. It is a vital task to identify a list of significant term for constructing a practical ontology. In this paper, we present the use of a context-based term identification and extraction methodology for ontology construction from text document. The methodology is using a taxonomy and Wikipedia to support automatic term identification and extraction from structured documents with an assumption of candidate terms for a topic are often associated with its topic-specific keywords. A hierarchical relationship of super-topics and sub-topics is defined by a taxonomy, meanwhile, Wikipedia is used to provide context and background knowledge for topics that defined in the taxonomy to guide the term identification and extraction. The experimental results have shown the context-based term identification and extraction methodology is viable in defining topic concepts and its sub-concepts for constructing ontology. The experimental results have also proven its viability to be applied in a small corpus / text size environment in supporting ontology construction.
{"title":"Context-based term identification and extraction for ontology construction","authors":"Hui-Ngo Goh, Ching Kiu","doi":"10.1109/NLPKE.2010.5587801","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587801","url":null,"abstract":"Ontology construction often requires a domain specific corpus in conceptualizing the domain knowledge; specifically, it is an association of terms, relation between terms and related instances. It is a vital task to identify a list of significant term for constructing a practical ontology. In this paper, we present the use of a context-based term identification and extraction methodology for ontology construction from text document. The methodology is using a taxonomy and Wikipedia to support automatic term identification and extraction from structured documents with an assumption of candidate terms for a topic are often associated with its topic-specific keywords. A hierarchical relationship of super-topics and sub-topics is defined by a taxonomy, meanwhile, Wikipedia is used to provide context and background knowledge for topics that defined in the taxonomy to guide the term identification and extraction. The experimental results have shown the context-based term identification and extraction methodology is viable in defining topic concepts and its sub-concepts for constructing ontology. The experimental results have also proven its viability to be applied in a small corpus / text size environment in supporting ontology construction.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126280720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587846
Haijun Zhang, Heyan Huang, Chao-Yong Zhu, Shumin Shi
This paper proposed a pragmatic model for repeat-based Chinese New Word Extraction (NWE). It contains two innovations. The first is a formal description for the process of NWE, which gives instructions on feature selection in theory. On the basis of this, the Conditional Random Fields model (CRF) is selected as statistical framework to solve the formal description. The second is an improved algorithm for left (right) entropy to improve the efficiency of NWE. By comparing with baseline algorithm, the improved algorithm can enhance the computational speed of entropy remarkably. On the whole, experiments show that the model this paper proposed is very effective, and the F score is 49.72% in open test and 69.83% in word extraction respectively, which is an evident improvement over previous similar works.
提出了一种基于重复的汉语新词提取的语用模型。它包含两个创新。第一部分是对NWE过程的形式化描述,从理论上给出了特征选择的指导。在此基础上,选择条件随机场模型(Conditional Random Fields model, CRF)作为统计框架来解决形式化描述问题。二是改进了左(右)熵算法,提高了NWE的效率。通过与基线算法的比较,改进后的算法能显著提高熵的计算速度。总体而言,实验表明本文提出的模型是非常有效的,开放测试的F值为49.72%,词语提取的F值为69.83%,与以往的同类作品相比有了明显的提高。
{"title":"A pragmatic model for new Chinese word extraction","authors":"Haijun Zhang, Heyan Huang, Chao-Yong Zhu, Shumin Shi","doi":"10.1109/NLPKE.2010.5587846","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587846","url":null,"abstract":"This paper proposed a pragmatic model for repeat-based Chinese New Word Extraction (NWE). It contains two innovations. The first is a formal description for the process of NWE, which gives instructions on feature selection in theory. On the basis of this, the Conditional Random Fields model (CRF) is selected as statistical framework to solve the formal description. The second is an improved algorithm for left (right) entropy to improve the efficiency of NWE. By comparing with baseline algorithm, the improved algorithm can enhance the computational speed of entropy remarkably. On the whole, experiments show that the model this paper proposed is very effective, and the F score is 49.72% in open test and 69.83% in word extraction respectively, which is an evident improvement over previous similar works.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122616272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The supervised learning has been applied into the query expansion techniques, which trains a model to predict the “goodness” or “utility” of the expanded term to the retrieval system. There are many features to measure the relatedness between the expanded word and the query, which can be incorporated in the supervised learning to select the expanded terms. The training data set is generated automatically by a tricky method. However, this method can be affected by many aspects. A severe problem is that the distribution of the features is query-dependent, which has not been discussed in previous work. With a different distribution on the features, it is questionable to merge these training instances together and use the whole data set to train one single model. In this paper, we first investigate the statistical distribution of the auto-generated training data and show the problems in the training data set. Based on our analysis, we proposed to use the bagging method to ensemble several regression models in order to get a better supervised model to make prediction on the expanded terms. We conducted the experiments on the TREC benchmark test collections. Our analysis on the training data reveals some interesting phenomena about the query expansion techniques. The experiment results also show that the bagging approach can achieve the state-of-art retrieval performance on the standard TREC data set.
{"title":"Bagging to find better expansion words","authors":"Bingqing Wang, Yaqian Zhou, Xipeng Qiu, Qi Zhang, Xuanjing Huang","doi":"10.1109/NLPKE.2010.5587826","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587826","url":null,"abstract":"The supervised learning has been applied into the query expansion techniques, which trains a model to predict the “goodness” or “utility” of the expanded term to the retrieval system. There are many features to measure the relatedness between the expanded word and the query, which can be incorporated in the supervised learning to select the expanded terms. The training data set is generated automatically by a tricky method. However, this method can be affected by many aspects. A severe problem is that the distribution of the features is query-dependent, which has not been discussed in previous work. With a different distribution on the features, it is questionable to merge these training instances together and use the whole data set to train one single model. In this paper, we first investigate the statistical distribution of the auto-generated training data and show the problems in the training data set. Based on our analysis, we proposed to use the bagging method to ensemble several regression models in order to get a better supervised model to make prediction on the expanded terms. We conducted the experiments on the TREC benchmark test collections. Our analysis on the training data reveals some interesting phenomena about the query expansion techniques. The experiment results also show that the bagging approach can achieve the state-of-art retrieval performance on the standard TREC data set.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125257444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}