Yu-Yang Huang, Rui Yan, Tsung-Ting Kuo, Shou-de Lin
We introduce a generalized framework to enrich the personalized language models for cold start users. The cold start problem is solved with content written by friends on social network services. Our framework consists of a mixture language model, whose mixture weights are es- timated with a factor graph. The factor graph is used to incorporate prior knowledge and heuris- tics to identify the most appropriate weights. The intrinsic and extrinsic experiments show significant improvement on cold start users.
{"title":"Enriching Cold Start Personalized Language Model Using Social Network Information","authors":"Yu-Yang Huang, Rui Yan, Tsung-Ting Kuo, Shou-de Lin","doi":"10.3115/v1/P14-2100","DOIUrl":"https://doi.org/10.3115/v1/P14-2100","url":null,"abstract":"We introduce a generalized framework to enrich the personalized language models for cold start users. The cold start problem is solved with content written by friends on social network services. Our framework consists of a mixture language model, whose mixture weights are es- timated with a factor graph. The factor graph is used to incorporate prior knowledge and heuris- tics to identify the most appropriate weights. The intrinsic and extrinsic experiments show significant improvement on cold start users.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"56 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114119622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.30019/IJCLCLP.201212.0002
Longyue Wang, Derek F. Wong, Lidia S. Chao
This paper proposed an integrated approach for Cross-Language Information Retrieval (CLIR), which integrated with four statistical models: Translation model, Query generation model, Document retrieval model and Length Filter model. Given a certain document in the source language, it will be translated into the target language of the statistical machine translation model. The query generation model then selects the most relevant words in the translated version of the document as a query. Instead of retrieving all the target documents with the query, the length-based model can help to filter out a large amount of irrelevant candidates according to their length information. Finally, the left documents in the target language are scored by the document searching model, which mainly computes the similarities between query and document.Different from the traditional parallel corpora-based model which relies on IBM algorithm, we divided our CLIR model into four independent parts but all work together to deal with the term disambiguation, query generation and document retrieval. Besides, the TQDL method can efficiently solve the problem of translation ambiguity and query expansion for disambiguation, which are the big issues in Cross-Language Information Retrieval. Another contribution is the length filter, which are trained from a parallel corpus according to the ratio of length between two languages. This can not only improve the recall value due to filtering out lots of useless documents dynamically, but also increase the efficiency in a smaller search space. Therefore, the precision can be improved but not at the cost of recall.In order to evaluate the retrieval performance of the proposed model on cross-languages document retrieval, a number of experiments have been conducted on different settings. Firstly, the Europarl corpus which is the collection of parallel texts in 11 languages from the proceedings of the European Parliament was used for evaluation. And we tested the models extensively to the case that: the lengths of texts are uneven and some of them may have similar contents under the same topic, because it is hard to be distinguished and make full use of the resources.After comparing different strategies, the experimental results show a significant performance of the method. The precision is normally above 90% by using a larger query size. The length-based filter plays a very important role in improving the F-measure and optimizing efficiency.This fully illustrates the discrimination power of the proposed method. It is of a great significance to both cross-language searching on the Internet and the parallel corpus producing for statistical machine translation systems. In the future work, the TQDL system will be evaluated for Chinese language, which is a big changing and more meaningful to CLIR.
{"title":"TQDL: Integrated Models for Cross-Language Document Retrieval","authors":"Longyue Wang, Derek F. Wong, Lidia S. Chao","doi":"10.30019/IJCLCLP.201212.0002","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201212.0002","url":null,"abstract":"This paper proposed an integrated approach for Cross-Language Information Retrieval (CLIR), which integrated with four statistical models: Translation model, Query generation model, Document retrieval model and Length Filter model. Given a certain document in the source language, it will be translated into the target language of the statistical machine translation model. The query generation model then selects the most relevant words in the translated version of the document as a query. Instead of retrieving all the target documents with the query, the length-based model can help to filter out a large amount of irrelevant candidates according to their length information. Finally, the left documents in the target language are scored by the document searching model, which mainly computes the similarities between query and document.Different from the traditional parallel corpora-based model which relies on IBM algorithm, we divided our CLIR model into four independent parts but all work together to deal with the term disambiguation, query generation and document retrieval. Besides, the TQDL method can efficiently solve the problem of translation ambiguity and query expansion for disambiguation, which are the big issues in Cross-Language Information Retrieval. Another contribution is the length filter, which are trained from a parallel corpus according to the ratio of length between two languages. This can not only improve the recall value due to filtering out lots of useless documents dynamically, but also increase the efficiency in a smaller search space. Therefore, the precision can be improved but not at the cost of recall.In order to evaluate the retrieval performance of the proposed model on cross-languages document retrieval, a number of experiments have been conducted on different settings. Firstly, the Europarl corpus which is the collection of parallel texts in 11 languages from the proceedings of the European Parliament was used for evaluation. And we tested the models extensively to the case that: the lengths of texts are uneven and some of them may have similar contents under the same topic, because it is hard to be distinguished and make full use of the resources.After comparing different strategies, the experimental results show a significant performance of the method. The precision is normally above 90% by using a larger query size. The length-based filter plays a very important role in improving the F-measure and optimizing efficiency.This fully illustrates the discrimination power of the proposed method. It is of a great significance to both cross-language searching on the Internet and the parallel corpus producing for statistical machine translation systems. In the future work, the TQDL system will be evaluated for Chinese language, which is a big changing and more meaningful to CLIR.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133274913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.30019/IJCLCLP.201212.0001
Wei-Yun Ma, K. McKeown
Statistical machine translation has made tremendous progress over the past ten years. The output of even the best systems, however, is often ungrammatical because of the lack of sufficient linguistic knowledge. Even when systems incorporate syntax in the translation process, syntactic errors still result. To address this issue, we present a novel approach for detecting and correcting ungrammatical translations. In order to simultaneously detect multiple errors and their corresponding words in a formal framework, we use feature-based lexicalized tree adjoining grammars, where each lexical item is associated with a syntactic elementary tree, in which each node is associated with a set of feature-value pairs to define the lexical item’s syntactic usage. Our syntactic error detection works by checking the feature values of all lexical items within a sentence using a unification framework. In order to simultaneously detect multiple error types and track their corresponding words, we propose a new unification method which allows the unification procedure to continue when unification fails and also to propagate the failure information to relevant words. Once error types and their corresponding words are detected, one is able to correct errors based on a unified consideration of all related words under the same error types. In this paper, we present some simple mechanism to handle part of the detected situations. We use our approach to detect and correct translations of six single statistical machine translation systems. The results show that most of the corrected translations are improved.
{"title":"Detecting and Correcting Syntactic Errors in Machine Translation Using Feature-Based Lexicalized Tree Adjoining Grammars","authors":"Wei-Yun Ma, K. McKeown","doi":"10.30019/IJCLCLP.201212.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201212.0001","url":null,"abstract":"Statistical machine translation has made tremendous progress over the past ten years. The output of even the best systems, however, is often ungrammatical because of the lack of sufficient linguistic knowledge. Even when systems incorporate syntax in the translation process, syntactic errors still result. To address this issue, we present a novel approach for detecting and correcting ungrammatical translations. In order to simultaneously detect multiple errors and their corresponding words in a formal framework, we use feature-based lexicalized tree adjoining grammars, where each lexical item is associated with a syntactic elementary tree, in which each node is associated with a set of feature-value pairs to define the lexical item’s syntactic usage. Our syntactic error detection works by checking the feature values of all lexical items within a sentence using a unification framework. In order to simultaneously detect multiple error types and track their corresponding words, we propose a new unification method which allows the unification procedure to continue when unification fails and also to propagate the failure information to relevant words. Once error types and their corresponding words are detected, one is able to correct errors based on a unified consideration of all related words under the same error types. In this paper, we present some simple mechanism to handle part of the detected situations. We use our approach to detect and correct translations of six single statistical machine translation systems. The results show that most of the corrected translations are improved.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127184175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.30019/IJCLCLP.201209.0005
Yu-Yun Chang
This paper explores the relationship between intelligibility and comprehensibility in speech synthesizers, and it designs an appropriate comprehension task for evaluating the speech synthesizers' comprehensibility. Previous studies have predicted that a speech synthesizer with higher intelligibility will have higher performance in comprehension. Also, since the two most popular speech synthesis methods are HMM-based and unit selection, this study tries to compare whether the HTS-2008 (HMM-based) or Multisyn (unit selection) speech synthesizer has better performance in application. Natural speech is applied in the experiment as a control group to the speech synthesizers. The results in the intelligibility test show that natural speech is better than HTS-2008, which, in turn, is much better than the Multisyn system. In the comprehension task, however, all three of the speech systems display minimal differences in the speech comprehension process. This is because the two speech synthesizers have reached the threshold of having enough intelligibility to provide high speech comprehension quality. Therefore, although there is equal comprehensible speech quality between the HTS-2008 and Multisyn systems, the HTS-2008 speech synthesizer is recommended due to its higher intelligibility.
{"title":"Evaluation of TTS Systems in Intelligibility and Comprehension Tasks: a Case Study of HTS-2008 and Multisyn Synthesizers","authors":"Yu-Yun Chang","doi":"10.30019/IJCLCLP.201209.0005","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201209.0005","url":null,"abstract":"This paper explores the relationship between intelligibility and comprehensibility in speech synthesizers, and it designs an appropriate comprehension task for evaluating the speech synthesizers' comprehensibility. Previous studies have predicted that a speech synthesizer with higher intelligibility will have higher performance in comprehension. Also, since the two most popular speech synthesis methods are HMM-based and unit selection, this study tries to compare whether the HTS-2008 (HMM-based) or Multisyn (unit selection) speech synthesizer has better performance in application. Natural speech is applied in the experiment as a control group to the speech synthesizers. The results in the intelligibility test show that natural speech is better than HTS-2008, which, in turn, is much better than the Multisyn system. In the comprehension task, however, all three of the speech systems display minimal differences in the speech comprehension process. This is because the two speech synthesizers have reached the threshold of having enough intelligibility to provide high speech comprehension quality. Therefore, although there is equal comprehensible speech quality between the HTS-2008 and Multisyn systems, the HTS-2008 speech synthesizer is recommended due to its higher intelligibility.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"125 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114025085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.30019/IJCLCLP.201209.0004
Chuan-Jie Lin, Jia-Cheng Zhan, Yen-Heng Chen, Chien-Wei Pao
This paper proposes an approach to identify word candidates that are not Traditional Chinese, including Japanese names (written in Japanese Kanji or Traditional Chinese characters) and word variants, when doing word segmentation on Traditional Chinese text. When handling personal names, a probability model concerning formats of names is introduced. We also propose a method to map Japanese Kanji into the corresponding Traditional Chinese characters. The same method can also be used to detect words written in character variants. After integrating generation rules for various types of special words, as well as their probability models, the F-measure of our word segmentation system rises from 94.16% to 96.06%. Another experiment shows that 83.18% of the 862 Japanese names in a set of 109 human-annotated documents can be successfully detected.
{"title":"Strategies of Processing Japanese Names and Character Variants in Traditional Chinese Text","authors":"Chuan-Jie Lin, Jia-Cheng Zhan, Yen-Heng Chen, Chien-Wei Pao","doi":"10.30019/IJCLCLP.201209.0004","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201209.0004","url":null,"abstract":"This paper proposes an approach to identify word candidates that are not Traditional Chinese, including Japanese names (written in Japanese Kanji or Traditional Chinese characters) and word variants, when doing word segmentation on Traditional Chinese text. When handling personal names, a probability model concerning formats of names is introduced. We also propose a method to map Japanese Kanji into the corresponding Traditional Chinese characters. The same method can also be used to detect words written in character variants. After integrating generation rules for various types of special words, as well as their probability models, the F-measure of our word segmentation system rises from 94.16% to 96.06%. Another experiment shows that 83.18% of the 862 Japanese names in a set of 109 human-annotated documents can be successfully detected.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121420475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.30019/IJCLCLP.201209.0001
Yi-Hsuan Chuang, Chao-Lin Liu, Jing-Shin Chang
We studied a special case of the translation of English verbs in verb-object pairs. Researchers have studied the effects of the linguistic information of the verbs being translated, and many have reported how considering the objects of the verbs will facilitate the quality of translation. In this study, we took an extreme approach-assuming the availability of the Chinese translation of the English object. In a related exploration, we examined how the availability of the Chinese translation of the English verb influences the translation quality of the English nouns in verb phrases with analogous procedures. We explored the issue with 35 thousand VN pairs that we extracted from the training data obtained from the 2011 NTCIR PatentMT workshop and with 4.8 thousand VN pairs that we extracted from a bilingual version of Scientific American magazine. The results indicated that, when the English verbs and objects were known, the additional information about the Chinese translations of the English verbs (or nouns) could improve the translation quality of the English nouns (or verbs) but not significantly. Further experiments were conducted to compare the quality of translation achieved by our programs and by human subjects. Given the same set of information for translation decisions, human subjects did not outperform our programs, reconfirming that good translations depend heavily on contextual information of wider ranges.
{"title":"Effects of Combining Bilingual and Collocational Information on Translation of English and Chinese Verb-Noun Pairs","authors":"Yi-Hsuan Chuang, Chao-Lin Liu, Jing-Shin Chang","doi":"10.30019/IJCLCLP.201209.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201209.0001","url":null,"abstract":"We studied a special case of the translation of English verbs in verb-object pairs. Researchers have studied the effects of the linguistic information of the verbs being translated, and many have reported how considering the objects of the verbs will facilitate the quality of translation. In this study, we took an extreme approach-assuming the availability of the Chinese translation of the English object. In a related exploration, we examined how the availability of the Chinese translation of the English verb influences the translation quality of the English nouns in verb phrases with analogous procedures. We explored the issue with 35 thousand VN pairs that we extracted from the training data obtained from the 2011 NTCIR PatentMT workshop and with 4.8 thousand VN pairs that we extracted from a bilingual version of Scientific American magazine. The results indicated that, when the English verbs and objects were known, the additional information about the Chinese translations of the English verbs (or nouns) could improve the translation quality of the English nouns (or verbs) but not significantly. Further experiments were conducted to compare the quality of translation achieved by our programs and by human subjects. Given the same set of information for translation decisions, human subjects did not outperform our programs, reconfirming that good translations depend heavily on contextual information of wider ranges.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117283076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-09-01DOI: 10.30019/IJCLCLP.201209.0003
Mike Tian-Jian Jiang, Cheng-Wei Shih, Ting-Hao Yang, Chan-Hung Kuo, Richard Tzong-Han Tsai, W. Hsu
This work proposes a unified view of several features based on frequent strings extracted from unlabeled data that improve the conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based n-gram (CNG), accessor variety based string (AVS) and its variation of left-right co-existed feature (LRAVS), term-contributed frequency (TCF), and term-contributed boundary (TCB) with a specific manner of boundary overlapping. For the experiments, the baseline is the 6-tag, a state-of-the-art labeling scheme of CRF-based CWS, and the data set is acquired from the 2005 CWS Bakeoff of Special Interest Group on Chinese Language Processing (SIGHAN) of the Association for Computational Linguistics (ACL) and SIGHAN CWS Bakeoff 2010. The experimental results show that all of these features improve the performance of the baseline system in terms of recall, precision, and their harmonic average as F1 measure score, on both accuracy (F) and out-of-vocabulary recognition (FOOV). In particular, this work presents compound features involving LRAVS/AVS and TCF/TCB that are competitive with other types of features for CRF-based CWS in terms of F and FOOV, respectively.
{"title":"Enhancement of Feature Engineering for Conditional Random Field Learning in Chinese Word Segmentation Using Unlabeled Data","authors":"Mike Tian-Jian Jiang, Cheng-Wei Shih, Ting-Hao Yang, Chan-Hung Kuo, Richard Tzong-Han Tsai, W. Hsu","doi":"10.30019/IJCLCLP.201209.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201209.0003","url":null,"abstract":"This work proposes a unified view of several features based on frequent strings extracted from unlabeled data that improve the conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based n-gram (CNG), accessor variety based string (AVS) and its variation of left-right co-existed feature (LRAVS), term-contributed frequency (TCF), and term-contributed boundary (TCB) with a specific manner of boundary overlapping. For the experiments, the baseline is the 6-tag, a state-of-the-art labeling scheme of CRF-based CWS, and the data set is acquired from the 2005 CWS Bakeoff of Special Interest Group on Chinese Language Processing (SIGHAN) of the Association for Computational Linguistics (ACL) and SIGHAN CWS Bakeoff 2010. The experimental results show that all of these features improve the performance of the baseline system in terms of recall, precision, and their harmonic average as F1 measure score, on both accuracy (F) and out-of-vocabulary recognition (FOOV). In particular, this work presents compound features involving LRAVS/AVS and TCF/TCB that are competitive with other types of features for CRF-based CWS in terms of F and FOOV, respectively.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127023670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-01DOI: 10.30019/IJCLCLP.201206.0003
Sheng-Fu Wang, Jing-Chen Yang, Yu-Yun Chang, Yu-Wen Liu, S. Hsieh
This study examines how different dimensions of corpus frequency data may affect the outcome of statistical modeling of lexical items. Our analysis mainly focuses on a recently constructed elderly speaker corpus that is used to reveal patterns of aging people's language use. A conversational corpus contributed by speakers in their 20s serves as complementary material. The target words examined are temporal expressions, which might reveal how the speech produced by the elderly is organized. We conduct divisive hierarchical clustering analyses based on two different dimensions of corporal data, namely raw frequency distribution and collocation-based vectors. When different dimensions of data were used as the input, results showed that the target terms were clustered in different ways. Analyses based on frequency distributions and collocational patterns are distinct from each other. Specifically, statistically-based collocational analysis generally produces more distinct clustering results that differentiate temporal terms more delicately than do the ones based on raw frequency.
{"title":"Frequency, Collocation, and Statistical Modeling of Lexical Items: A Case Study of Temporal Expressions in Two Conversational Corpora","authors":"Sheng-Fu Wang, Jing-Chen Yang, Yu-Yun Chang, Yu-Wen Liu, S. Hsieh","doi":"10.30019/IJCLCLP.201206.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201206.0003","url":null,"abstract":"This study examines how different dimensions of corpus frequency data may affect the outcome of statistical modeling of lexical items. Our analysis mainly focuses on a recently constructed elderly speaker corpus that is used to reveal patterns of aging people's language use. A conversational corpus contributed by speakers in their 20s serves as complementary material. The target words examined are temporal expressions, which might reveal how the speech produced by the elderly is organized. We conduct divisive hierarchical clustering analyses based on two different dimensions of corporal data, namely raw frequency distribution and collocation-based vectors. When different dimensions of data were used as the input, results showed that the target terms were clustered in different ways. Analyses based on frequency distributions and collocational patterns are distinct from each other. Specifically, statistically-based collocational analysis generally produces more distinct clustering results that differentiate temporal terms more delicately than do the ones based on raw frequency.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122381551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-03-01DOI: 10.30019/IJCLCLP.201203.0001
Jia-Cing Ruan, Chiung-Wen Hsu, J. Myers, Jane S. Tsay
The usual challenges of transcribing spoken language are compounded for Southern Min (Taiwanese) because it lacks a generally accepted orthography. This study reports the development and testing of software tools for assisting such transcription. Three tools are compared, each representing a different type of interface with our corpus-based Southern Min lexicon (Tsay, 2007): our original Chinese character-based tool (Segmentor), the first version of a romanization-based lexicon entry tool called Adult-Corpus Romanization Input Program (ACRIP 1.0), and a revised version of ACRIP that accepts both character and romanization inputs and integrates them with sound files (ACRIP 2.0). In two experiments, naive native speakers of Southern Min were asked to transcribe passages from our corpus of adult spoken Southern Min (Tsay and Myers, in progress), using one or more of these tools. Experiment 1 showed no disadvantage for romanization-based compared with character-based transcription even for untrained transcribers. Experiment 2 showed significant advantages of the new mixed-system tool (ACRIP 2.0) over both Segmentor and ACRIP 1.0, in both speed and accuracy of transcription. Experiment 2 also showed that only minimal additional training brought dramatic improvements in both speed and accuracy. These results suggest that the transcription of non-Mandarin Sinitic languages benefits from flexible, integrated software tools.
{"title":"Development and Testing of Transcription Software for a Southern Min Spoken Corpus","authors":"Jia-Cing Ruan, Chiung-Wen Hsu, J. Myers, Jane S. Tsay","doi":"10.30019/IJCLCLP.201203.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201203.0001","url":null,"abstract":"The usual challenges of transcribing spoken language are compounded for Southern Min (Taiwanese) because it lacks a generally accepted orthography. This study reports the development and testing of software tools for assisting such transcription. Three tools are compared, each representing a different type of interface with our corpus-based Southern Min lexicon (Tsay, 2007): our original Chinese character-based tool (Segmentor), the first version of a romanization-based lexicon entry tool called Adult-Corpus Romanization Input Program (ACRIP 1.0), and a revised version of ACRIP that accepts both character and romanization inputs and integrates them with sound files (ACRIP 2.0). In two experiments, naive native speakers of Southern Min were asked to transcribe passages from our corpus of adult spoken Southern Min (Tsay and Myers, in progress), using one or more of these tools. Experiment 1 showed no disadvantage for romanization-based compared with character-based transcription even for untrained transcribers. Experiment 2 showed significant advantages of the new mixed-system tool (ACRIP 2.0) over both Segmentor and ACRIP 1.0, in both speed and accuracy of transcription. Experiment 2 also showed that only minimal additional training brought dramatic improvements in both speed and accuracy. These results suggest that the transcription of non-Mandarin Sinitic languages benefits from flexible, integrated software tools.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132636449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-12-01DOI: 10.30019/IJCLCLP.201112.0003
Bor-Shen Lin, Yi-Cong Chen
With the evolution of human lives and the spread of information, new things emerge quickly and new terms are created every day. Therefore, it is important for natural language processing systems to extract new words in progression with time. Due to the broad areas of applications, however, there might exist the mismatch of statistical characteristics between the training domain and the testing domain, which inevitably degrades the performance of word extraction. This paper proposes a scheme of word extraction in which histogram equalization for feature normalization is used. Through this scheme, the mismatch of the feature distributions due to different corpus sizes or changes of domain can be compensated for appropriately such that unknown word extraction becomes more reliable and applicable to novice domains. The scheme was initially evaluated on the corpora announced in SIGHAN2. 68.43% and 71.40% F-measures for word identification, which correspond to 66.72%/32.94% and 75.99%/58.39% recall rates for IV/OOV, respectively, were achieved for the CKIP and the CUHK test sets, respectively, using four combined features with equalization. When applied to unknown word extraction for a novice domain, this scheme can identify such pronouns as ”海角七號” (Cape No. 7, the name of a film), ”蠟筆小新” (Crayon Shinchan, the name of a cartoon figure), ”金融海嘯” (Financial Tsunami) and so on, which cannot be extracted reliably with rule-based approaches, although the approach appears not so good at identifying such terms as the names of humans, places, or organizations, for which the semantic structure is prominent. This scheme is complementary with the outcomes of two word segmentation systems, and is promising if other rule-based approaches could be further integrated.
{"title":"Histogram Equalization on Statistical Approaches for Chinese Unknown Word Extraction","authors":"Bor-Shen Lin, Yi-Cong Chen","doi":"10.30019/IJCLCLP.201112.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201112.0003","url":null,"abstract":"With the evolution of human lives and the spread of information, new things emerge quickly and new terms are created every day. Therefore, it is important for natural language processing systems to extract new words in progression with time. Due to the broad areas of applications, however, there might exist the mismatch of statistical characteristics between the training domain and the testing domain, which inevitably degrades the performance of word extraction. This paper proposes a scheme of word extraction in which histogram equalization for feature normalization is used. Through this scheme, the mismatch of the feature distributions due to different corpus sizes or changes of domain can be compensated for appropriately such that unknown word extraction becomes more reliable and applicable to novice domains. The scheme was initially evaluated on the corpora announced in SIGHAN2. 68.43% and 71.40% F-measures for word identification, which correspond to 66.72%/32.94% and 75.99%/58.39% recall rates for IV/OOV, respectively, were achieved for the CKIP and the CUHK test sets, respectively, using four combined features with equalization. When applied to unknown word extraction for a novice domain, this scheme can identify such pronouns as ”海角七號” (Cape No. 7, the name of a film), ”蠟筆小新” (Crayon Shinchan, the name of a cartoon figure), ”金融海嘯” (Financial Tsunami) and so on, which cannot be extracted reliably with rule-based approaches, although the approach appears not so good at identifying such terms as the names of humans, places, or organizations, for which the semantic structure is prominent. This scheme is complementary with the outcomes of two word segmentation systems, and is promising if other rule-based approaches could be further integrated.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115192051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}