{"title":"中文签名文件中三音节查询错误命中","authors":"Tyne Liang, Suh-Yin Lee, Wei-Pang Yang","doi":"10.1109/ICDAR.1995.598966","DOIUrl":null,"url":null,"abstract":"In the application of the superimposed coding method to character-based Chinese text retrieval we find two kinds of false hits for a multi-syllabic (multicharacter) query. The first type is a random false hit (RFH) which is due to accidental setting of bits by irrelevant characters in a document signature. The other type is an adjacency false hit (AFH) which is due to the loss of character sequence information in signature creation. Since many query terms are proper nouns and Chinese names which often contain three characters (tri-syllabic), we derive a formula to estimate the RFH for trisyllabic queries. As for the AFH which cannot be reduced by single character (monogram) hashing method, a method which hashes consecutive character pairs (bigram) is designed to reduce both the AFH and the RFH. We find that there exists an optimal weight assignment for a minimal false hit rate in a combined scheme which encodes both monogram and bigram keys in document signatures.","PeriodicalId":273519,"journal":{"name":"Proceedings of 3rd International Conference on Document Analysis and Recognition","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1995-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"False hits of tri-syllabic queries in a Chinese signature file\",\"authors\":\"Tyne Liang, Suh-Yin Lee, Wei-Pang Yang\",\"doi\":\"10.1109/ICDAR.1995.598966\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the application of the superimposed coding method to character-based Chinese text retrieval we find two kinds of false hits for a multi-syllabic (multicharacter) query. The first type is a random false hit (RFH) which is due to accidental setting of bits by irrelevant characters in a document signature. The other type is an adjacency false hit (AFH) which is due to the loss of character sequence information in signature creation. Since many query terms are proper nouns and Chinese names which often contain three characters (tri-syllabic), we derive a formula to estimate the RFH for trisyllabic queries. As for the AFH which cannot be reduced by single character (monogram) hashing method, a method which hashes consecutive character pairs (bigram) is designed to reduce both the AFH and the RFH. We find that there exists an optimal weight assignment for a minimal false hit rate in a combined scheme which encodes both monogram and bigram keys in document signatures.\",\"PeriodicalId\":273519,\"journal\":{\"name\":\"Proceedings of 3rd International Conference on Document Analysis and Recognition\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1995-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of 3rd International Conference on Document Analysis and Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDAR.1995.598966\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of 3rd International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.1995.598966","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
False hits of tri-syllabic queries in a Chinese signature file
In the application of the superimposed coding method to character-based Chinese text retrieval we find two kinds of false hits for a multi-syllabic (multicharacter) query. The first type is a random false hit (RFH) which is due to accidental setting of bits by irrelevant characters in a document signature. The other type is an adjacency false hit (AFH) which is due to the loss of character sequence information in signature creation. Since many query terms are proper nouns and Chinese names which often contain three characters (tri-syllabic), we derive a formula to estimate the RFH for trisyllabic queries. As for the AFH which cannot be reduced by single character (monogram) hashing method, a method which hashes consecutive character pairs (bigram) is designed to reduce both the AFH and the RFH. We find that there exists an optimal weight assignment for a minimal false hit rate in a combined scheme which encodes both monogram and bigram keys in document signatures.