首页 > 最新文献

Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)最新文献

英文 中文
Web-based technical term translation pairs mining for patent document translation 基于web的技术术语翻译对挖掘,用于专利文献翻译
Feiliang Ren, Jingbo Zhu, Huizhen Wang
This paper proposes a simple but powerful approach for obtaining technical term translation pairs in patent domain from Web automatically. First, several technical terms are used as seed queries and submitted to search engineering. Secondly, an extraction algorithm is proposed to extract some key word translation pairs from the returned web pages. Finally, a multi-feature based evaluation method is proposed to pick up those translation pairs that are true technical term translation pairs in patent domain. With this method, we obtain about 8,890,000 key word translation pairs which can be used to translate the technical terms in patent documents. And experimental results show that the precision of these translation pairs are more than 99%, and the coverage of these translation pairs for the technical terms in patent documents are more than 84%.
本文提出了一种简单而有效的从Web上自动获取专利领域专业术语翻译对的方法。首先,使用几个技术术语作为种子查询并提交给搜索工程。其次,提出了一种提取算法,从返回的网页中提取关键字翻译对。最后,提出了一种基于多特征的评价方法,以提取专利领域中真正的技术术语翻译对。利用该方法,我们获得了约8890000个关键词翻译对,这些关键词翻译对可用于翻译专利文献中的技术术语。实验结果表明,这些翻译对的翻译精度在99%以上,对专利文献中技术术语的翻译覆盖率在84%以上。
{"title":"Web-based technical term translation pairs mining for patent document translation","authors":"Feiliang Ren, Jingbo Zhu, Huizhen Wang","doi":"10.1109/NLPKE.2010.5587775","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587775","url":null,"abstract":"This paper proposes a simple but powerful approach for obtaining technical term translation pairs in patent domain from Web automatically. First, several technical terms are used as seed queries and submitted to search engineering. Secondly, an extraction algorithm is proposed to extract some key word translation pairs from the returned web pages. Finally, a multi-feature based evaluation method is proposed to pick up those translation pairs that are true technical term translation pairs in patent domain. With this method, we obtain about 8,890,000 key word translation pairs which can be used to translate the technical terms in patent documents. And experimental results show that the precision of these translation pairs are more than 99%, and the coverage of these translation pairs for the technical terms in patent documents are more than 84%.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121554019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
iTree - Automating the construction of the narration tree of Hadiths (Prophetic Traditions) iTree -自动构建圣训(先知传统)的叙述树
Aqil M. Azmi, Nawaf Bin Badia
The two fundamental sources of Islamic legislation are Qur'an and the Hadith. The Hadiths, or Prophetic Traditions, are narrations originating from the sayings and conducts of Prophet Muhammad. Each Hadith starts with a list of narrators involved in transmitting it followed by the transmitted text. The Hadith corpus is extremely huge and runs into hundreds of volumes. Due to its legislative importance, Hadiths have been carefully scrutinized by hadith scholars. One way a scholar may grade a Hadith is by its narration chain and the individual narrators in the chain. In this paper we report on a system that automatically generates the transmission chains of a Hadith and graphically display it. Computationally, this is a challenging problem. The text of Hadith is in Arabic, a morphologically rich language; and each Hadith has its own peculiar way of listing narrators. Our solution involves parsing and annotating the Hadith text and identifying the narrators' names. We use shallow parsing along with a domain specific grammar to parse the Hadith content. Experiments on sample Hadiths show our approach to have a very good success rate.
伊斯兰立法的两个基本来源是古兰经和圣训。圣训,或先知的传统,是源自先知穆罕默德的言论和行为的叙述。每一段圣训的开头都列出了参与传播的叙述者,然后是传播的文本。圣训文集极其庞大,多达数百卷。由于其立法的重要性,圣训学者仔细审查。学者给圣训评分的一种方法是通过它的叙述链和链中的个别叙述者。本文介绍了一种能够自动生成圣训传播链并图形化显示的系统。在计算上,这是一个具有挑战性的问题。圣训的文本是阿拉伯语,一种形态丰富的语言;每个圣训都有自己独特的叙述方式。我们的解决方案包括解析和注释圣训文本,并识别叙述者的名字。我们使用浅层解析和特定于领域的语法来解析圣训内容。对样本圣训的实验表明,我们的方法有很好的成功率。
{"title":"iTree - Automating the construction of the narration tree of Hadiths (Prophetic Traditions)","authors":"Aqil M. Azmi, Nawaf Bin Badia","doi":"10.1109/NLPKE.2010.5587810","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587810","url":null,"abstract":"The two fundamental sources of Islamic legislation are Qur'an and the Hadith. The Hadiths, or Prophetic Traditions, are narrations originating from the sayings and conducts of Prophet Muhammad. Each Hadith starts with a list of narrators involved in transmitting it followed by the transmitted text. The Hadith corpus is extremely huge and runs into hundreds of volumes. Due to its legislative importance, Hadiths have been carefully scrutinized by hadith scholars. One way a scholar may grade a Hadith is by its narration chain and the individual narrators in the chain. In this paper we report on a system that automatically generates the transmission chains of a Hadith and graphically display it. Computationally, this is a challenging problem. The text of Hadith is in Arabic, a morphologically rich language; and each Hadith has its own peculiar way of listing narrators. Our solution involves parsing and annotating the Hadith text and identifying the narrators' names. We use shallow parsing along with a domain specific grammar to parse the Hadith content. Experiments on sample Hadiths show our approach to have a very good success rate.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"120 3‐4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132908081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Information retrieval by text summarization for an Indian regional language 一种印度地区语言的文本摘要信息检索
Jagadish S. Kallimani, K. Srinivasa, B. E. Reddy
The Information Extraction is a method for filtering information from large volumes of text. Information Extraction is a limited task than full text understanding. In full text understanding, we aspire to represent in an explicit fashion about all the information in a text. In contrast, in Information Extraction, we delimit in advance, as part of the specification of the task and the semantic range of the output. In this paper, a model for summarization from large documents using a novel approach has been proposed. Extending the work for an Indian regional language (Kannada) and various analyses of results were discussed.
信息抽取是一种从大量文本中过滤信息的方法。与全文理解相比,信息提取是一项有限的任务。在全文理解中,我们希望以一种明确的方式表达文本中的所有信息。相比之下,在信息提取中,我们提前划分,作为任务规范和输出语义范围的一部分。本文提出了一种利用新方法对大型文档进行摘要的模型。讨论了扩展印度地区语言(卡纳达语)的工作和对结果的各种分析。
{"title":"Information retrieval by text summarization for an Indian regional language","authors":"Jagadish S. Kallimani, K. Srinivasa, B. E. Reddy","doi":"10.1109/NLPKE.2010.5587764","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587764","url":null,"abstract":"The Information Extraction is a method for filtering information from large volumes of text. Information Extraction is a limited task than full text understanding. In full text understanding, we aspire to represent in an explicit fashion about all the information in a text. In contrast, in Information Extraction, we delimit in advance, as part of the specification of the task and the semantic range of the output. In this paper, a model for summarization from large documents using a novel approach has been proposed. Extending the work for an Indian regional language (Kannada) and various analyses of results were discussed.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129165516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Computerized electronic nursing staffs' daily records system in the “A” psychiatric hospital: Present situation and future prospects “A”精神病医院护理人员计算机电子日常记录系统的现状与展望
T. Tanioka, A. Kawamura, Mai Date, K. Osaka, Yuko Yasuhara, M. Kataoka, Yukie Iwasa, Toshihiro Sugiyama, Kazuyuki Matsumoto, Tomoko Kawata, Misako Satou, K. Mifune
At the “A” psychiatric hospital, previously nurses used paper-based nursing staffs' daily records. We aimed to manage the higher quality nursing and introduced “electronic management system for nursing staffs' daily records system (ENSDR)” interlocked with “Psychoms ®” into this hospital. Some good effects were achieved by introducing this system. However, some problems have been left in this system. The purpose of this study is to evaluate the current situation and challenges which brought out by using ENSDR, and to indicate the future direction of the development.
在“A”精神病医院,以前护士使用纸质护理人员的日常记录。我们以管理更高质量的护理为目标,将“护理人员日常记录电子管理系统(ENSDR)”与“Psychoms®”联锁引入本院。该系统的应用取得了良好的效果。然而,这一制度也存在一些问题。本研究的目的是评估使用ENSDR的现状和挑战,并指出未来的发展方向。
{"title":"Computerized electronic nursing staffs' daily records system in the “A” psychiatric hospital: Present situation and future prospects","authors":"T. Tanioka, A. Kawamura, Mai Date, K. Osaka, Yuko Yasuhara, M. Kataoka, Yukie Iwasa, Toshihiro Sugiyama, Kazuyuki Matsumoto, Tomoko Kawata, Misako Satou, K. Mifune","doi":"10.1109/NLPKE.2010.5587814","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587814","url":null,"abstract":"At the “A” psychiatric hospital, previously nurses used paper-based nursing staffs' daily records. We aimed to manage the higher quality nursing and introduced “electronic management system for nursing staffs' daily records system (ENSDR)” interlocked with “Psychoms ®” into this hospital. Some good effects were achieved by introducing this system. However, some problems have been left in this system. The purpose of this study is to evaluate the current situation and challenges which brought out by using ENSDR, and to indicate the future direction of the development.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127284984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Chinese base phrases chunking based on latent semi-CRF model 基于潜在半crf模型的汉语基本短语分块
Xiao Sun, Xiaoli Nan
In the fields of Chinese natural language processing, recognizing simple and non-recursive base phrases is an important task for natural language processing applications, such as information processing and machine translation. Instead of rule-based model, we adopt the statistical machine learning method, newly proposed Latent semi-CRF model to solve the Chinese base phrase chunking problem. The Chinese base phrases could be treated as the sequence labeling problem, which involve the prediction of a class label for each frame in an unsegmented sequence. The Chinese base phrases have sub-structures which could not be observed in training data. We propose a latent discriminative model called Latent semi-CRF(Latent Semi Conditional Random Fields), which incorporates the advantages of LDCRF(Latent Dynamic Conditional Random Fields) and semi-CRF that model the sub-structure of a class sequence and learn dynamics between class labels, in detecting the Chinese base phrases. Our results demonstrate that the latent dynamic discriminative model compares favorably to Support Vector Machines, Maximum Entropy Model, and Conditional Random Fields(including LDCRF and semi-CRF) on Chinese base phrases chunking.
在汉语自然语言处理领域中,简单非递归基短语的识别是信息处理和机器翻译等自然语言处理应用的重要任务。本文采用统计机器学习方法和新提出的Latent半crf模型代替基于规则的模型来解决中文基短语分块问题。汉语基本短语可以看作是序列标注问题,它涉及到对未分割序列中每一帧的类标记进行预测。汉语基本短语具有在训练数据中观察不到的子结构。我们提出了一种潜在判别模型,称为潜在半条件随机场(latent Semi - Conditional Random Fields),该模型结合了LDCRF(latent Dynamic Conditional Random Fields)和半条件随机场(Semi - crf)对类序列的子结构建模和类标签之间的动态学习的优点,用于汉语基短语的检测。我们的研究结果表明,潜在动态判别模型在中文基础短语分块上优于支持向量机、最大熵模型和条件随机场(包括LDCRF和半crf)。
{"title":"Chinese base phrases chunking based on latent semi-CRF model","authors":"Xiao Sun, Xiaoli Nan","doi":"10.1109/NLPKE.2010.5587802","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587802","url":null,"abstract":"In the fields of Chinese natural language processing, recognizing simple and non-recursive base phrases is an important task for natural language processing applications, such as information processing and machine translation. Instead of rule-based model, we adopt the statistical machine learning method, newly proposed Latent semi-CRF model to solve the Chinese base phrase chunking problem. The Chinese base phrases could be treated as the sequence labeling problem, which involve the prediction of a class label for each frame in an unsegmented sequence. The Chinese base phrases have sub-structures which could not be observed in training data. We propose a latent discriminative model called Latent semi-CRF(Latent Semi Conditional Random Fields), which incorporates the advantages of LDCRF(Latent Dynamic Conditional Random Fields) and semi-CRF that model the sub-structure of a class sequence and learn dynamics between class labels, in detecting the Chinese base phrases. Our results demonstrate that the latent dynamic discriminative model compares favorably to Support Vector Machines, Maximum Entropy Model, and Conditional Random Fields(including LDCRF and semi-CRF) on Chinese base phrases chunking.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133124987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Chinese semantic role labeling based on semantic knowledge 基于语义知识的汉语语义角色标注
Yanqiu Shao, Zhifang Sui, Ning Mao
Most of the semantic role labeling systems use syntactic analysis results to predict semantic roles. However, there are some problems that could not be well-done only by syntactic features. In this paper, lexical semantic features are extracted from some semantic dictionaries. Two typical lexical semantic dictionaries are used, TongYiCi CiLin and CSD. CiLin is built on convergent relationship and CSD is based on syntagmatic relationship. According to both of the dictionaries, two labeling models are set up, CiLin model and CSD model. Also, one pure syntactic model and one mixed model are built. The mixed model combines all of the syntactic and semantic features. The experimental results show that the application of different level of lexical semantic knowledge could help use some language inherent attributes and the knowledge could help to improve the performance of the system.
大多数语义角色标注系统使用句法分析结果来预测语义角色。然而,也有一些问题是仅靠句法特征无法很好地解决的。本文从一些语义词典中提取词汇语义特征。两种典型的词汇语义词典:同义词词典和CSD词典。clin是建立在收敛关系上的,CSD是建立在组合关系上的。根据两种词典,建立了两种标注模型:CiLin模型和CSD模型。建立了一个纯语法模型和一个混合语法模型。混合模型结合了所有的语法和语义特征。实验结果表明,使用不同层次的词汇语义知识有助于利用语言固有属性,提高系统的性能。
{"title":"Chinese semantic role labeling based on semantic knowledge","authors":"Yanqiu Shao, Zhifang Sui, Ning Mao","doi":"10.1109/NLPKE.2010.5587821","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587821","url":null,"abstract":"Most of the semantic role labeling systems use syntactic analysis results to predict semantic roles. However, there are some problems that could not be well-done only by syntactic features. In this paper, lexical semantic features are extracted from some semantic dictionaries. Two typical lexical semantic dictionaries are used, TongYiCi CiLin and CSD. CiLin is built on convergent relationship and CSD is based on syntagmatic relationship. According to both of the dictionaries, two labeling models are set up, CiLin model and CSD model. Also, one pure syntactic model and one mixed model are built. The mixed model combines all of the syntactic and semantic features. The experimental results show that the application of different level of lexical semantic knowledge could help use some language inherent attributes and the knowledge could help to improve the performance of the system.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114382717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A method for generating document summary using field association knowledge and subjectively information 一种利用领域关联知识和主观信息生成文档摘要的方法
Abdunabi Ubul, E. Atlam, K. Morita, M. Fuketa, J. Aoe
In the recent years, with the expansion of the Internet there has been tremendous growth in the volume of electronic text documents available information on the Web, which making difficulty for users to locate efficiently needed information. To facilitate efficient searching for information, research to summarize the general outline of a text document is essential. Moreover, as the information from bulletin boards, blogs, and other sources is being used as consumer generated media data, text summarization become necessary. In this paper a new method for document summary using three attribute information called: the field, associated terms, and attribute grammars is presented, this method establish a formal and efficient generation technology. From the experiments results it turns out that the summary accuracy rate, readability, and meaning integrity are 87.5%, 85%, and 86%, respectively using information from 400 blogs.
近年来,随着互联网的发展,网络上的电子文本文档数量急剧增加,这给用户有效定位所需信息带来了困难。为了方便有效地搜索信息,研究总结文本文档的总体轮廓是必不可少的。此外,由于来自公告板、博客和其他来源的信息被用作消费者生成的媒体数据,文本摘要就变得必要了。本文提出了一种利用字段、关联术语和属性语法三种属性信息进行文档摘要的新方法,该方法建立了一种形式化、高效的生成技术。实验结果表明,使用400个博客的信息,摘要的准确率、可读性和意义完整性分别为87.5%、85%和86%。
{"title":"A method for generating document summary using field association knowledge and subjectively information","authors":"Abdunabi Ubul, E. Atlam, K. Morita, M. Fuketa, J. Aoe","doi":"10.1109/NLPKE.2010.5587853","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587853","url":null,"abstract":"In the recent years, with the expansion of the Internet there has been tremendous growth in the volume of electronic text documents available information on the Web, which making difficulty for users to locate efficiently needed information. To facilitate efficient searching for information, research to summarize the general outline of a text document is essential. Moreover, as the information from bulletin boards, blogs, and other sources is being used as consumer generated media data, text summarization become necessary. In this paper a new method for document summary using three attribute information called: the field, associated terms, and attribute grammars is presented, this method establish a formal and efficient generation technology. From the experiments results it turns out that the summary accuracy rate, readability, and meaning integrity are 87.5%, 85%, and 86%, respectively using information from 400 blogs.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122098779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A new method for solving context ambiguities using field association knowledge 一种利用领域关联知识解决上下文歧义的新方法
Li Wang, E. Atlam, M. Fuketa, K. Morita, J. Aoe
In computational linguistics, word sense disambiguation is an open problem and is important in various aspects of natural language processing. However, the traditional methods using case frames and semantic primitives are not effective for solving context ambiguities that require information beyond sentences. This paper presents a new method of solving context ambiguities using a field association scheme that can determine the specified fields by using field association (FA) terms. In order to solve context ambiguities, the formal disambiguation algorithm is calculating the weight of fields in that scope by controlling the scope for a set of variable number of sentences. The accuracy of disambiguating the context ambiguities is improved 65% by applying the proposed field association knowledge.
在计算语言学中,词义消歧是一个开放的问题,在自然语言处理的各个方面都很重要。然而,传统的使用格框架和语义原语的方法对于解决需要句子以外信息的上下文歧义并不有效。本文提出了一种使用字段关联方案解决上下文歧义的新方法,该方案可以通过字段关联项确定指定的字段。为了解决上下文歧义,形式消歧义算法通过控制一组可变数量的句子的范围来计算该范围内字段的权重。应用提出的领域关联知识对上下文歧义进行消歧的准确率提高了65%。
{"title":"A new method for solving context ambiguities using field association knowledge","authors":"Li Wang, E. Atlam, M. Fuketa, K. Morita, J. Aoe","doi":"10.1109/NLPKE.2010.5587858","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587858","url":null,"abstract":"In computational linguistics, word sense disambiguation is an open problem and is important in various aspects of natural language processing. However, the traditional methods using case frames and semantic primitives are not effective for solving context ambiguities that require information beyond sentences. This paper presents a new method of solving context ambiguities using a field association scheme that can determine the specified fields by using field association (FA) terms. In order to solve context ambiguities, the formal disambiguation algorithm is calculating the weight of fields in that scope by controlling the scope for a set of variable number of sentences. The accuracy of disambiguating the context ambiguities is improved 65% by applying the proposed field association knowledge.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122567155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Realization of a high performance bilingual OCR system for Thai-English printed documents 高性能泰英双语OCR系统的实现
S. Tangwongsan, Buntida Suvacharakulton
This paper presents a high performance bilingual OCR system for printed Thai and English text. With the complex nature of both Thai and English languages, the first stage is to identify languages within different zones by using geometric properties for differentiation. The second stage is the process of character recognition, in which the technique developed includes a feature extractor and a classifier. In the feature extraction, the thinned character image is analyzed and categorized into groups. Next, the classifier will take in two steps of recognition: the coarse level, followed by the fine level with a guide of decision trees. As to obtain an even better result, the final stage attempts to make use of dictionary look-up as to check for accuracy improvement in an overall performance. For verification, the system is tested by a series of experiments with printed documents in 141 pages and over 280,000 characters, the result shows that the system could obtain an accuracy of 100% in Thai monolingual, 98.18% in English monolingual, and 99.85% in bilingual documents on the average. In the final stage with a dictionary look-up, the system could yield a better accuracy of improvement up to 99.98% in bilingual documents as expected.
本文提出了一种高性能的泰语和英语文本双语OCR系统。由于泰语和英语语言的复杂性,第一阶段是通过使用几何属性来区分不同区域内的语言。第二阶段是字符识别过程,其中所开发的技术包括特征提取器和分类器。在特征提取中,对减薄后的字符图像进行分析和分组。接下来,分类器将分两个步骤进行识别:粗糙层,然后是在决策树的指导下进行精细层。为了获得更好的结果,最后阶段尝试使用字典查找来检查整体性能中的准确性改进。为了验证该系统的有效性,对141页28万字以上的打印文档进行了一系列实验,结果表明,该系统在泰语单语、英语单语和双语文档中的平均准确率分别达到100%、98.18%和99.85%。在最后阶段,通过词典查询,系统在双语文档中的准确率达到了预期的99.98%。
{"title":"Realization of a high performance bilingual OCR system for Thai-English printed documents","authors":"S. Tangwongsan, Buntida Suvacharakulton","doi":"10.1109/NLPKE.2010.5587781","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587781","url":null,"abstract":"This paper presents a high performance bilingual OCR system for printed Thai and English text. With the complex nature of both Thai and English languages, the first stage is to identify languages within different zones by using geometric properties for differentiation. The second stage is the process of character recognition, in which the technique developed includes a feature extractor and a classifier. In the feature extraction, the thinned character image is analyzed and categorized into groups. Next, the classifier will take in two steps of recognition: the coarse level, followed by the fine level with a guide of decision trees. As to obtain an even better result, the final stage attempts to make use of dictionary look-up as to check for accuracy improvement in an overall performance. For verification, the system is tested by a series of experiments with printed documents in 141 pages and over 280,000 characters, the result shows that the system could obtain an accuracy of 100% in Thai monolingual, 98.18% in English monolingual, and 99.85% in bilingual documents on the average. In the final stage with a dictionary look-up, the system could yield a better accuracy of improvement up to 99.98% in bilingual documents as expected.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121132894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Document expansion using relevant web documents for spoken document retrieval 文档扩展使用相关的网络文档为口语文档检索
Ryo Masumura, A. Ito, Yu Uno, Masashi Ito, S. Makino
Recently, automatic indexing of a spoken document using a speech recognizer attracts attention. However, index generation from an automatic transcription has many problems because the automatic transcription has many recognition errors and Out-Of-Vocabulary words. To solve this problem, we propose a document expansion method using Web documents. To obtain important keywords which included in the spoken document but lost by recognition errors, we acquire Web documents relevant to the spoken document. Then, an index of the spoken document is generated by combining an index that generated from the automatic transcription and the Web documents. We propose a method for retrieval of relevant documents, and the experimental result shows that the retrieved Web document contained many OOV words. Next, we propose a method for combining the recognized index and the Web index. The experimental result shows that the index of the spoken document generated by the document expansion was closer to an index from the manual transcription than the index generated by the conventional method. Finally, we conducted a spoken document retrieval experiment, and the document-expansion-based index gave better retrieval precision than the conventional indexing method.
最近,利用语音识别器对语音文档进行自动索引引起了人们的关注。然而,由于自动转录存在许多识别错误和词汇外的问题,自动转录的索引生成存在许多问题。为了解决这个问题,我们提出了一种利用Web文档展开文档的方法。为了获得口语中包含但因识别错误而丢失的重要关键词,我们获取了与口语相关的Web文档。然后,通过组合从自动转录和Web文档生成的索引来生成口语文档的索引。我们提出了一种检索相关文档的方法,实验结果表明,检索到的Web文档包含了大量的面向对象词。接下来,我们提出了一种将识别索引与Web索引相结合的方法。实验结果表明,与传统方法生成的索引相比,该方法生成的口语文档索引更接近于人工抄写的索引。最后,我们进行了口语文档检索实验,结果表明基于文档扩展的索引比传统的索引方法具有更好的检索精度。
{"title":"Document expansion using relevant web documents for spoken document retrieval","authors":"Ryo Masumura, A. Ito, Yu Uno, Masashi Ito, S. Makino","doi":"10.1109/NLPKE.2010.5587854","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587854","url":null,"abstract":"Recently, automatic indexing of a spoken document using a speech recognizer attracts attention. However, index generation from an automatic transcription has many problems because the automatic transcription has many recognition errors and Out-Of-Vocabulary words. To solve this problem, we propose a document expansion method using Web documents. To obtain important keywords which included in the spoken document but lost by recognition errors, we acquire Web documents relevant to the spoken document. Then, an index of the spoken document is generated by combining an index that generated from the automatic transcription and the Web documents. We propose a method for retrieval of relevant documents, and the experimental result shows that the retrieved Web document contained many OOV words. Next, we propose a method for combining the recognized index and the Web index. The experimental result shows that the index of the spoken document generated by the document expansion was closer to an index from the manual transcription than the index generated by the conventional method. Finally, we conducted a spoken document retrieval experiment, and the document-expansion-based index gave better retrieval precision than the conventional indexing method.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"27 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121007971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1