Pub Date : 1997-02-01DOI: 10.30019/IJCLCLP.199702.0002
Chin-Chuan Cheng
This paper is a synthesis of the past studies in measurements of dialect relationships. The phonological data of 17 Chinese dialects that were computerized in the late 1960s have been utilized for measurements of dialect distance. In addition, a file of over 6,400 lexical variants in 18 dialects was also used to quantify dialect affinity. This writing first explains the nature, the organization, and the coding of these files. A series of steps illustrate how the phonological file was processed to derive the needed information for calculation of correlation coefficients. The coefficients are considered as indices of dialect affinity. The dialects are then grouped by the average linking method of cluster analysis of the coefficients. The appropriateness of the correlation method to the data is then discussed. Recent work on calculation of dialect mutual intelligibility is presented to indicate the future direction of research.
{"title":"Measuring Relationship among Dialects: DOC and Related Resources","authors":"Chin-Chuan Cheng","doi":"10.30019/IJCLCLP.199702.0002","DOIUrl":"https://doi.org/10.30019/IJCLCLP.199702.0002","url":null,"abstract":"This paper is a synthesis of the past studies in measurements of dialect relationships. The phonological data of 17 Chinese dialects that were computerized in the late 1960s have been utilized for measurements of dialect distance. In addition, a file of over 6,400 lexical variants in 18 dialects was also used to quantify dialect affinity. This writing first explains the nature, the organization, and the coding of these files. A series of steps illustrate how the phonological file was processed to derive the needed information for calculation of correlation coefficients. The coefficients are considered as indices of dialect affinity. The dialects are then grouped by the average linking method of cluster analysis of the coefficients. The appropriateness of the correlation method to the data is then discussed. Recent work on calculation of dialect mutual intelligibility is presented to indicate the future direction of research.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127861821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-02-01DOI: 10.30019/IJCLCLP.199702.0003
Hsiao-Chuan Wang
A cooperative project, called ”Polyphone”, was initiated by the Coordinating Committee on Speech Databases and Speech I/O Systems Assessment (COCOSDA) in 1992. Accordingly, a project to collect Mandarin speech data across Taiwan (MAT) was conducted by a group of researchers from several universities and research organizations in Taiwan. The purpose was to generate a speech corpus for the development of Mandarin-based speech technology and products. The speech data were collected at eight recording stations through telephone networks. The speakers were chosen so as to reflect the population of the gender, the dialect, the educational level, and the residence .in Taiwan. A preliminary Mandarin speech database of 800 speakers has been produced. The final goal is to generate a speech database of at. least 5000 speakers.
{"title":"MAT - A Project to Collect Mandarin Speech Data Through Telephone Net works in Taiwan","authors":"Hsiao-Chuan Wang","doi":"10.30019/IJCLCLP.199702.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.199702.0003","url":null,"abstract":"A cooperative project, called ”Polyphone”, was initiated by the Coordinating Committee on Speech Databases and Speech I/O Systems Assessment (COCOSDA) in 1992. Accordingly, a project to collect Mandarin speech data across Taiwan (MAT) was conducted by a group of researchers from several universities and research organizations in Taiwan. The purpose was to generate a speech corpus for the development of Mandarin-based speech technology and products. The speech data were collected at eight recording stations through telephone networks. The speakers were chosen so as to reflect the population of the gender, the dialect, the educational level, and the residence .in Taiwan. A preliminary Mandarin speech database of 800 speakers has been produced. The final goal is to generate a speech database of at. least 5000 speakers.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133673561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-08-01DOI: 10.30019/IJCLCLP.199608.0006
Keh-Jiann Chen
The Chinese language has many special characteristics which are substantially different from western languages, causing conventional methods of language processing to fail on Chinese. For example, Chinese sentences are composed of strings of characters without word boundaries that are marked by spaces. Therefore, word segmentation and unknown word identification techniques must be used in order to identify words in Chinese. In addition, Chinese has very few inflectional or grammatical markers, making purely syntactic approaches to parsing almost impossible. Hence, a unified approach which involves both syntactic and semantic information must be used. Therefore, a lexical feature-based grammar formalism, called Information-based Case Grammar, is adopted for the parsing model proposed here. This grammar formalism stipulates that a lexical entry for a word contains both semantic and syntactic feature structures. By relaxing the constraints on lexical feature structures, even ill-formed input can be accepted, broadening the coverage of the grammar. A model of a priority controlled chart parser is proposed which, in conjunction with a mechanism of dynamic grammar extension, addresses the problems of: (1) syntactic ambiguities, (2) under-specification and limited coverage of grammars, and (3) ill-formed sentences. The model does this without causing inefficient parsing of sentences that do not require relaxation of constraints or dynamic extension of the grammar.
{"title":"A Model for Robust Chinese Parser","authors":"Keh-Jiann Chen","doi":"10.30019/IJCLCLP.199608.0006","DOIUrl":"https://doi.org/10.30019/IJCLCLP.199608.0006","url":null,"abstract":"The Chinese language has many special characteristics which are substantially different from western languages, causing conventional methods of language processing to fail on Chinese. For example, Chinese sentences are composed of strings of characters without word boundaries that are marked by spaces. Therefore, word segmentation and unknown word identification techniques must be used in order to identify words in Chinese. In addition, Chinese has very few inflectional or grammatical markers, making purely syntactic approaches to parsing almost impossible. Hence, a unified approach which involves both syntactic and semantic information must be used. Therefore, a lexical feature-based grammar formalism, called Information-based Case Grammar, is adopted for the parsing model proposed here. This grammar formalism stipulates that a lexical entry for a word contains both semantic and syntactic feature structures. By relaxing the constraints on lexical feature structures, even ill-formed input can be accepted, broadening the coverage of the grammar. A model of a priority controlled chart parser is proposed which, in conjunction with a mechanism of dynamic grammar extension, addresses the problems of: (1) syntactic ambiguities, (2) under-specification and limited coverage of grammars, and (3) ill-formed sentences. The model does this without causing inefficient parsing of sentences that do not require relaxation of constraints or dynamic extension of the grammar.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129765226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-08-01DOI: 10.30019/IJCLCLP.199608.0004
Keh-Yih Su, Tung-Hui Chiang, Jing-Shin Chang
A Corpus-Based Statistics-Oriented (CBSO) methodology, which is an attempt to avoid the drawbacks of traditional rule-based approaches and purely statistical approaches, is introduced in this paper. Rule-based approaches, with rules induced by human experts, had been the dominant paradigm in the natural language processing community. Such approaches, however, suffer from serious difficulties in knowledge acquisition in terms of cost and consistency. Therefore, it is very difficult for such systems to be scaled-up. Statistical methods, with the capability of automatically acquiring knowledge from corpora, are becoming more and more popular, in part, to amend the shortcomings of rule-based approaches. However, most simple statistical models, which adopt almost nothing from existing linguistic knowledge, often result in a large parameter space and, thus, require an unaffordably large training corpus for even well-justified linguistic phenomena. The corpus-based statistics-oriented (CBSO) approach is a compromise between the two extremes of the spectrum for knowledge acquisition. CBSO approach emphasizes use of well-justified linguistic knowledge in developing the underlying language model and application of statistical optimization techniques on top of high level constructs, such as annotated syntax trees, rather than on surface strings, so that only a training corpus of reasonable size is needed for training and long distance dependency between constituents could be handled. In this paper, corpus-based statistics-oriented techniques are reviewed. General techniques applicable to CBSO approaches are introduced. In particular, we shall address the following important issues: (1) general tasks in developing an NLP system; (2) why CBSO is the preferred choice among different strategies; (3) how to achieve good performance systematically using a CBSO approach, and (4) frequently used CBSO techniques. Several examples are also reviewed.
{"title":"An Overview of Corpus-Based Statistics-Oriented(CBSO) Techniques for Natural Language Processing","authors":"Keh-Yih Su, Tung-Hui Chiang, Jing-Shin Chang","doi":"10.30019/IJCLCLP.199608.0004","DOIUrl":"https://doi.org/10.30019/IJCLCLP.199608.0004","url":null,"abstract":"A Corpus-Based Statistics-Oriented (CBSO) methodology, which is an attempt to avoid the drawbacks of traditional rule-based approaches and purely statistical approaches, is introduced in this paper. Rule-based approaches, with rules induced by human experts, had been the dominant paradigm in the natural language processing community. Such approaches, however, suffer from serious difficulties in knowledge acquisition in terms of cost and consistency. Therefore, it is very difficult for such systems to be scaled-up. Statistical methods, with the capability of automatically acquiring knowledge from corpora, are becoming more and more popular, in part, to amend the shortcomings of rule-based approaches. However, most simple statistical models, which adopt almost nothing from existing linguistic knowledge, often result in a large parameter space and, thus, require an unaffordably large training corpus for even well-justified linguistic phenomena. The corpus-based statistics-oriented (CBSO) approach is a compromise between the two extremes of the spectrum for knowledge acquisition. CBSO approach emphasizes use of well-justified linguistic knowledge in developing the underlying language model and application of statistical optimization techniques on top of high level constructs, such as annotated syntax trees, rather than on surface strings, so that only a training corpus of reasonable size is needed for training and long distance dependency between constituents could be handled. In this paper, corpus-based statistics-oriented techniques are reviewed. General techniques applicable to CBSO approaches are introduced. In particular, we shall address the following important issues: (1) general tasks in developing an NLP system; (2) why CBSO is the preferred choice among different strategies; (3) how to achieve good performance systematically using a CBSO approach, and (4) frequently used CBSO techniques. Several examples are also reviewed.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133243666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-08-01DOI: 10.30019/IJCLCLP.199608.0005
Kuang-hua Chen, Hsin-Hsi Chen
It is difficult for pure statistics-based machine translation systems to process long sentences. In addition, the domain dependent problem is a key issue under such a framework. Pure rule-based machine translation systems have many human costs in formulating rules and introduce inconsistencies when the number of rules increases. Integration of these two approaches reduces the difficulties associated with both. In this paper, an integrated model for machine translation system is proposed. A partial parsing method is adopted, and the translation process is performed chunk by chunk. In the synthesis module, the word order is locally rearranged within chunks via the Markov model. Since the length of a chunk is much shorter than that of a sentence, the disadvantage of the Markov model in dealing with long distance phenomena is greatly reduced. Structural transfer is fulfilled using a set of rules; in contrast, lexical transfer is resolved using bilingual constraints. Qualitative and quantitative knowledge is employed interleavingly and cooperatively, so that the advantages of these two approaches can be retained.
{"title":"A Hybrid Approach to Machine Translation System Design","authors":"Kuang-hua Chen, Hsin-Hsi Chen","doi":"10.30019/IJCLCLP.199608.0005","DOIUrl":"https://doi.org/10.30019/IJCLCLP.199608.0005","url":null,"abstract":"It is difficult for pure statistics-based machine translation systems to process long sentences. In addition, the domain dependent problem is a key issue under such a framework. Pure rule-based machine translation systems have many human costs in formulating rules and introduce inconsistencies when the number of rules increases. Integration of these two approaches reduces the difficulties associated with both. In this paper, an integrated model for machine translation system is proposed. A partial parsing method is adopted, and the translation process is performed chunk by chunk. In the synthesis module, the word order is locally rearranged within chunks via the Markov model. Since the length of a chunk is much shorter than that of a sentence, the disadvantage of the Markov model in dealing with long distance phenomena is greatly reduced. Structural transfer is fulfilled using a set of rules; in contrast, lexical transfer is resolved using bilingual constraints. Qualitative and quantitative knowledge is employed interleavingly and cooperatively, so that the advantages of these two approaches can be retained.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125127017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-08-01DOI: 10.30019/IJCLCLP.199608.0001
Chin-Hui Lee, B. Juang
For the past two decades, research in speech recognition has been intensively carried out worldwide, spurred on by advances in signal processing, algorithms, architectures, and hardware. Speech recognition systems have been developed for a wide variety of applications, ranging from small vocabulary keyword recognition over dial-up telephone lines, to medium size vocabulary voice interactive command and control systems on personal computers, to large vocabulary speech dictation, spontaneous speech understanding, and limited-domain speech translation. In this paper we review some of the key advances in several areas of automatic speech recognition. We also illustrate, by examples, how these key advances can be used for continuous speech recognition of Mandarin. Finally we elaborate the requirements in designing successful real-world applications and address technical challenges that need to be harnessed in order to reach the ultimate goal of providing an easy-to-use, natural, and flexible voice interface between people and machines.
{"title":"A Survey on Automatic Speech Recognition with an Illustrative Example on Continuous Speech Recognition of Mandarin","authors":"Chin-Hui Lee, B. Juang","doi":"10.30019/IJCLCLP.199608.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.199608.0001","url":null,"abstract":"For the past two decades, research in speech recognition has been intensively carried out worldwide, spurred on by advances in signal processing, algorithms, architectures, and hardware. Speech recognition systems have been developed for a wide variety of applications, ranging from small vocabulary keyword recognition over dial-up telephone lines, to medium size vocabulary voice interactive command and control systems on personal computers, to large vocabulary speech dictation, spontaneous speech understanding, and limited-domain speech translation. In this paper we review some of the key advances in several areas of automatic speech recognition. We also illustrate, by examples, how these key advances can be used for continuous speech recognition of Mandarin. Finally we elaborate the requirements in designing successful real-world applications and address technical challenges that need to be harnessed in order to reach the ultimate goal of providing an easy-to-use, natural, and flexible voice interface between people and machines.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128011270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-08-01DOI: 10.30019/IJCLCLP.199608.0007
Lee-Feng Chien, H. Pu
In this paper, we will emphasize the significance of Chinese information retrieval in this age of the Internet, and raise several important research issues which are fundamental and require further investigation. At the same time, we will point out some problems and requirements which have often been neglected in designing general Chinese IR systems. Furthermore, experiences obtained from the design of the Csmart system will be described also.
{"title":"Important Issues on Chinese Information Retrieval","authors":"Lee-Feng Chien, H. Pu","doi":"10.30019/IJCLCLP.199608.0007","DOIUrl":"https://doi.org/10.30019/IJCLCLP.199608.0007","url":null,"abstract":"In this paper, we will emphasize the significance of Chinese information retrieval in this age of the Internet, and raise several important research issues which are fundamental and require further investigation. At the same time, we will point out some problems and requirements which have often been neglected in designing general Chinese IR systems. Furthermore, experiences obtained from the design of the Csmart system will be described also.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133162203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}