Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587775
Feiliang Ren, Jingbo Zhu, Huizhen Wang
This paper proposes a simple but powerful approach for obtaining technical term translation pairs in patent domain from Web automatically. First, several technical terms are used as seed queries and submitted to search engineering. Secondly, an extraction algorithm is proposed to extract some key word translation pairs from the returned web pages. Finally, a multi-feature based evaluation method is proposed to pick up those translation pairs that are true technical term translation pairs in patent domain. With this method, we obtain about 8,890,000 key word translation pairs which can be used to translate the technical terms in patent documents. And experimental results show that the precision of these translation pairs are more than 99%, and the coverage of these translation pairs for the technical terms in patent documents are more than 84%.
{"title":"Web-based technical term translation pairs mining for patent document translation","authors":"Feiliang Ren, Jingbo Zhu, Huizhen Wang","doi":"10.1109/NLPKE.2010.5587775","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587775","url":null,"abstract":"This paper proposes a simple but powerful approach for obtaining technical term translation pairs in patent domain from Web automatically. First, several technical terms are used as seed queries and submitted to search engineering. Secondly, an extraction algorithm is proposed to extract some key word translation pairs from the returned web pages. Finally, a multi-feature based evaluation method is proposed to pick up those translation pairs that are true technical term translation pairs in patent domain. With this method, we obtain about 8,890,000 key word translation pairs which can be used to translate the technical terms in patent documents. And experimental results show that the precision of these translation pairs are more than 99%, and the coverage of these translation pairs for the technical terms in patent documents are more than 84%.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121554019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587810
Aqil M. Azmi, Nawaf Bin Badia
The two fundamental sources of Islamic legislation are Qur'an and the Hadith. The Hadiths, or Prophetic Traditions, are narrations originating from the sayings and conducts of Prophet Muhammad. Each Hadith starts with a list of narrators involved in transmitting it followed by the transmitted text. The Hadith corpus is extremely huge and runs into hundreds of volumes. Due to its legislative importance, Hadiths have been carefully scrutinized by hadith scholars. One way a scholar may grade a Hadith is by its narration chain and the individual narrators in the chain. In this paper we report on a system that automatically generates the transmission chains of a Hadith and graphically display it. Computationally, this is a challenging problem. The text of Hadith is in Arabic, a morphologically rich language; and each Hadith has its own peculiar way of listing narrators. Our solution involves parsing and annotating the Hadith text and identifying the narrators' names. We use shallow parsing along with a domain specific grammar to parse the Hadith content. Experiments on sample Hadiths show our approach to have a very good success rate.
{"title":"iTree - Automating the construction of the narration tree of Hadiths (Prophetic Traditions)","authors":"Aqil M. Azmi, Nawaf Bin Badia","doi":"10.1109/NLPKE.2010.5587810","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587810","url":null,"abstract":"The two fundamental sources of Islamic legislation are Qur'an and the Hadith. The Hadiths, or Prophetic Traditions, are narrations originating from the sayings and conducts of Prophet Muhammad. Each Hadith starts with a list of narrators involved in transmitting it followed by the transmitted text. The Hadith corpus is extremely huge and runs into hundreds of volumes. Due to its legislative importance, Hadiths have been carefully scrutinized by hadith scholars. One way a scholar may grade a Hadith is by its narration chain and the individual narrators in the chain. In this paper we report on a system that automatically generates the transmission chains of a Hadith and graphically display it. Computationally, this is a challenging problem. The text of Hadith is in Arabic, a morphologically rich language; and each Hadith has its own peculiar way of listing narrators. Our solution involves parsing and annotating the Hadith text and identifying the narrators' names. We use shallow parsing along with a domain specific grammar to parse the Hadith content. Experiments on sample Hadiths show our approach to have a very good success rate.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"120 3‐4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132908081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587764
Jagadish S. Kallimani, K. Srinivasa, B. E. Reddy
The Information Extraction is a method for filtering information from large volumes of text. Information Extraction is a limited task than full text understanding. In full text understanding, we aspire to represent in an explicit fashion about all the information in a text. In contrast, in Information Extraction, we delimit in advance, as part of the specification of the task and the semantic range of the output. In this paper, a model for summarization from large documents using a novel approach has been proposed. Extending the work for an Indian regional language (Kannada) and various analyses of results were discussed.
{"title":"Information retrieval by text summarization for an Indian regional language","authors":"Jagadish S. Kallimani, K. Srinivasa, B. E. Reddy","doi":"10.1109/NLPKE.2010.5587764","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587764","url":null,"abstract":"The Information Extraction is a method for filtering information from large volumes of text. Information Extraction is a limited task than full text understanding. In full text understanding, we aspire to represent in an explicit fashion about all the information in a text. In contrast, in Information Extraction, we delimit in advance, as part of the specification of the task and the semantic range of the output. In this paper, a model for summarization from large documents using a novel approach has been proposed. Extending the work for an Indian regional language (Kannada) and various analyses of results were discussed.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129165516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587814
T. Tanioka, A. Kawamura, Mai Date, K. Osaka, Yuko Yasuhara, M. Kataoka, Yukie Iwasa, Toshihiro Sugiyama, Kazuyuki Matsumoto, Tomoko Kawata, Misako Satou, K. Mifune
At the “A” psychiatric hospital, previously nurses used paper-based nursing staffs' daily records. We aimed to manage the higher quality nursing and introduced “electronic management system for nursing staffs' daily records system (ENSDR)” interlocked with “Psychoms ®” into this hospital. Some good effects were achieved by introducing this system. However, some problems have been left in this system. The purpose of this study is to evaluate the current situation and challenges which brought out by using ENSDR, and to indicate the future direction of the development.
{"title":"Computerized electronic nursing staffs' daily records system in the “A” psychiatric hospital: Present situation and future prospects","authors":"T. Tanioka, A. Kawamura, Mai Date, K. Osaka, Yuko Yasuhara, M. Kataoka, Yukie Iwasa, Toshihiro Sugiyama, Kazuyuki Matsumoto, Tomoko Kawata, Misako Satou, K. Mifune","doi":"10.1109/NLPKE.2010.5587814","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587814","url":null,"abstract":"At the “A” psychiatric hospital, previously nurses used paper-based nursing staffs' daily records. We aimed to manage the higher quality nursing and introduced “electronic management system for nursing staffs' daily records system (ENSDR)” interlocked with “Psychoms ®” into this hospital. Some good effects were achieved by introducing this system. However, some problems have been left in this system. The purpose of this study is to evaluate the current situation and challenges which brought out by using ENSDR, and to indicate the future direction of the development.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127284984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587802
Xiao Sun, Xiaoli Nan
In the fields of Chinese natural language processing, recognizing simple and non-recursive base phrases is an important task for natural language processing applications, such as information processing and machine translation. Instead of rule-based model, we adopt the statistical machine learning method, newly proposed Latent semi-CRF model to solve the Chinese base phrase chunking problem. The Chinese base phrases could be treated as the sequence labeling problem, which involve the prediction of a class label for each frame in an unsegmented sequence. The Chinese base phrases have sub-structures which could not be observed in training data. We propose a latent discriminative model called Latent semi-CRF(Latent Semi Conditional Random Fields), which incorporates the advantages of LDCRF(Latent Dynamic Conditional Random Fields) and semi-CRF that model the sub-structure of a class sequence and learn dynamics between class labels, in detecting the Chinese base phrases. Our results demonstrate that the latent dynamic discriminative model compares favorably to Support Vector Machines, Maximum Entropy Model, and Conditional Random Fields(including LDCRF and semi-CRF) on Chinese base phrases chunking.
在汉语自然语言处理领域中,简单非递归基短语的识别是信息处理和机器翻译等自然语言处理应用的重要任务。本文采用统计机器学习方法和新提出的Latent半crf模型代替基于规则的模型来解决中文基短语分块问题。汉语基本短语可以看作是序列标注问题,它涉及到对未分割序列中每一帧的类标记进行预测。汉语基本短语具有在训练数据中观察不到的子结构。我们提出了一种潜在判别模型,称为潜在半条件随机场(latent Semi - Conditional Random Fields),该模型结合了LDCRF(latent Dynamic Conditional Random Fields)和半条件随机场(Semi - crf)对类序列的子结构建模和类标签之间的动态学习的优点,用于汉语基短语的检测。我们的研究结果表明,潜在动态判别模型在中文基础短语分块上优于支持向量机、最大熵模型和条件随机场(包括LDCRF和半crf)。
{"title":"Chinese base phrases chunking based on latent semi-CRF model","authors":"Xiao Sun, Xiaoli Nan","doi":"10.1109/NLPKE.2010.5587802","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587802","url":null,"abstract":"In the fields of Chinese natural language processing, recognizing simple and non-recursive base phrases is an important task for natural language processing applications, such as information processing and machine translation. Instead of rule-based model, we adopt the statistical machine learning method, newly proposed Latent semi-CRF model to solve the Chinese base phrase chunking problem. The Chinese base phrases could be treated as the sequence labeling problem, which involve the prediction of a class label for each frame in an unsegmented sequence. The Chinese base phrases have sub-structures which could not be observed in training data. We propose a latent discriminative model called Latent semi-CRF(Latent Semi Conditional Random Fields), which incorporates the advantages of LDCRF(Latent Dynamic Conditional Random Fields) and semi-CRF that model the sub-structure of a class sequence and learn dynamics between class labels, in detecting the Chinese base phrases. Our results demonstrate that the latent dynamic discriminative model compares favorably to Support Vector Machines, Maximum Entropy Model, and Conditional Random Fields(including LDCRF and semi-CRF) on Chinese base phrases chunking.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133124987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587821
Yanqiu Shao, Zhifang Sui, Ning Mao
Most of the semantic role labeling systems use syntactic analysis results to predict semantic roles. However, there are some problems that could not be well-done only by syntactic features. In this paper, lexical semantic features are extracted from some semantic dictionaries. Two typical lexical semantic dictionaries are used, TongYiCi CiLin and CSD. CiLin is built on convergent relationship and CSD is based on syntagmatic relationship. According to both of the dictionaries, two labeling models are set up, CiLin model and CSD model. Also, one pure syntactic model and one mixed model are built. The mixed model combines all of the syntactic and semantic features. The experimental results show that the application of different level of lexical semantic knowledge could help use some language inherent attributes and the knowledge could help to improve the performance of the system.
{"title":"Chinese semantic role labeling based on semantic knowledge","authors":"Yanqiu Shao, Zhifang Sui, Ning Mao","doi":"10.1109/NLPKE.2010.5587821","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587821","url":null,"abstract":"Most of the semantic role labeling systems use syntactic analysis results to predict semantic roles. However, there are some problems that could not be well-done only by syntactic features. In this paper, lexical semantic features are extracted from some semantic dictionaries. Two typical lexical semantic dictionaries are used, TongYiCi CiLin and CSD. CiLin is built on convergent relationship and CSD is based on syntagmatic relationship. According to both of the dictionaries, two labeling models are set up, CiLin model and CSD model. Also, one pure syntactic model and one mixed model are built. The mixed model combines all of the syntactic and semantic features. The experimental results show that the application of different level of lexical semantic knowledge could help use some language inherent attributes and the knowledge could help to improve the performance of the system.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114382717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587853
Abdunabi Ubul, E. Atlam, K. Morita, M. Fuketa, J. Aoe
In the recent years, with the expansion of the Internet there has been tremendous growth in the volume of electronic text documents available information on the Web, which making difficulty for users to locate efficiently needed information. To facilitate efficient searching for information, research to summarize the general outline of a text document is essential. Moreover, as the information from bulletin boards, blogs, and other sources is being used as consumer generated media data, text summarization become necessary. In this paper a new method for document summary using three attribute information called: the field, associated terms, and attribute grammars is presented, this method establish a formal and efficient generation technology. From the experiments results it turns out that the summary accuracy rate, readability, and meaning integrity are 87.5%, 85%, and 86%, respectively using information from 400 blogs.
{"title":"A method for generating document summary using field association knowledge and subjectively information","authors":"Abdunabi Ubul, E. Atlam, K. Morita, M. Fuketa, J. Aoe","doi":"10.1109/NLPKE.2010.5587853","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587853","url":null,"abstract":"In the recent years, with the expansion of the Internet there has been tremendous growth in the volume of electronic text documents available information on the Web, which making difficulty for users to locate efficiently needed information. To facilitate efficient searching for information, research to summarize the general outline of a text document is essential. Moreover, as the information from bulletin boards, blogs, and other sources is being used as consumer generated media data, text summarization become necessary. In this paper a new method for document summary using three attribute information called: the field, associated terms, and attribute grammars is presented, this method establish a formal and efficient generation technology. From the experiments results it turns out that the summary accuracy rate, readability, and meaning integrity are 87.5%, 85%, and 86%, respectively using information from 400 blogs.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122098779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587858
Li Wang, E. Atlam, M. Fuketa, K. Morita, J. Aoe
In computational linguistics, word sense disambiguation is an open problem and is important in various aspects of natural language processing. However, the traditional methods using case frames and semantic primitives are not effective for solving context ambiguities that require information beyond sentences. This paper presents a new method of solving context ambiguities using a field association scheme that can determine the specified fields by using field association (FA) terms. In order to solve context ambiguities, the formal disambiguation algorithm is calculating the weight of fields in that scope by controlling the scope for a set of variable number of sentences. The accuracy of disambiguating the context ambiguities is improved 65% by applying the proposed field association knowledge.
{"title":"A new method for solving context ambiguities using field association knowledge","authors":"Li Wang, E. Atlam, M. Fuketa, K. Morita, J. Aoe","doi":"10.1109/NLPKE.2010.5587858","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587858","url":null,"abstract":"In computational linguistics, word sense disambiguation is an open problem and is important in various aspects of natural language processing. However, the traditional methods using case frames and semantic primitives are not effective for solving context ambiguities that require information beyond sentences. This paper presents a new method of solving context ambiguities using a field association scheme that can determine the specified fields by using field association (FA) terms. In order to solve context ambiguities, the formal disambiguation algorithm is calculating the weight of fields in that scope by controlling the scope for a set of variable number of sentences. The accuracy of disambiguating the context ambiguities is improved 65% by applying the proposed field association knowledge.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122567155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587781
S. Tangwongsan, Buntida Suvacharakulton
This paper presents a high performance bilingual OCR system for printed Thai and English text. With the complex nature of both Thai and English languages, the first stage is to identify languages within different zones by using geometric properties for differentiation. The second stage is the process of character recognition, in which the technique developed includes a feature extractor and a classifier. In the feature extraction, the thinned character image is analyzed and categorized into groups. Next, the classifier will take in two steps of recognition: the coarse level, followed by the fine level with a guide of decision trees. As to obtain an even better result, the final stage attempts to make use of dictionary look-up as to check for accuracy improvement in an overall performance. For verification, the system is tested by a series of experiments with printed documents in 141 pages and over 280,000 characters, the result shows that the system could obtain an accuracy of 100% in Thai monolingual, 98.18% in English monolingual, and 99.85% in bilingual documents on the average. In the final stage with a dictionary look-up, the system could yield a better accuracy of improvement up to 99.98% in bilingual documents as expected.
{"title":"Realization of a high performance bilingual OCR system for Thai-English printed documents","authors":"S. Tangwongsan, Buntida Suvacharakulton","doi":"10.1109/NLPKE.2010.5587781","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587781","url":null,"abstract":"This paper presents a high performance bilingual OCR system for printed Thai and English text. With the complex nature of both Thai and English languages, the first stage is to identify languages within different zones by using geometric properties for differentiation. The second stage is the process of character recognition, in which the technique developed includes a feature extractor and a classifier. In the feature extraction, the thinned character image is analyzed and categorized into groups. Next, the classifier will take in two steps of recognition: the coarse level, followed by the fine level with a guide of decision trees. As to obtain an even better result, the final stage attempts to make use of dictionary look-up as to check for accuracy improvement in an overall performance. For verification, the system is tested by a series of experiments with printed documents in 141 pages and over 280,000 characters, the result shows that the system could obtain an accuracy of 100% in Thai monolingual, 98.18% in English monolingual, and 99.85% in bilingual documents on the average. In the final stage with a dictionary look-up, the system could yield a better accuracy of improvement up to 99.98% in bilingual documents as expected.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121132894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-09-30DOI: 10.1109/NLPKE.2010.5587854
Ryo Masumura, A. Ito, Yu Uno, Masashi Ito, S. Makino
Recently, automatic indexing of a spoken document using a speech recognizer attracts attention. However, index generation from an automatic transcription has many problems because the automatic transcription has many recognition errors and Out-Of-Vocabulary words. To solve this problem, we propose a document expansion method using Web documents. To obtain important keywords which included in the spoken document but lost by recognition errors, we acquire Web documents relevant to the spoken document. Then, an index of the spoken document is generated by combining an index that generated from the automatic transcription and the Web documents. We propose a method for retrieval of relevant documents, and the experimental result shows that the retrieved Web document contained many OOV words. Next, we propose a method for combining the recognized index and the Web index. The experimental result shows that the index of the spoken document generated by the document expansion was closer to an index from the manual transcription than the index generated by the conventional method. Finally, we conducted a spoken document retrieval experiment, and the document-expansion-based index gave better retrieval precision than the conventional indexing method.
{"title":"Document expansion using relevant web documents for spoken document retrieval","authors":"Ryo Masumura, A. Ito, Yu Uno, Masashi Ito, S. Makino","doi":"10.1109/NLPKE.2010.5587854","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587854","url":null,"abstract":"Recently, automatic indexing of a spoken document using a speech recognizer attracts attention. However, index generation from an automatic transcription has many problems because the automatic transcription has many recognition errors and Out-Of-Vocabulary words. To solve this problem, we propose a document expansion method using Web documents. To obtain important keywords which included in the spoken document but lost by recognition errors, we acquire Web documents relevant to the spoken document. Then, an index of the spoken document is generated by combining an index that generated from the automatic transcription and the Web documents. We propose a method for retrieval of relevant documents, and the experimental result shows that the retrieved Web document contained many OOV words. Next, we propose a method for combining the recognized index and the Web index. The experimental result shows that the index of the spoken document generated by the document expansion was closer to an index from the manual transcription than the index generated by the conventional method. Finally, we conducted a spoken document retrieval experiment, and the document-expansion-based index gave better retrieval precision than the conventional indexing method.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"27 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121007971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}