Code-switching entails mixing multiple languages. It is an increasingly occurring phenomenon in social media texts. Usually, code-mixed texts are written in a single script, even though the languages involved have different scripts. Pre-trained multilingual models primarily utilize the data in the native script of the language. In existing studies, the code-switched texts are utilized as they are. However, using the native script for each language can generate better representations of the text owing to the pre-trained knowledge. Therefore, a cross-language-script knowledge sharing architecture utilizing the cross attention and alignment of the representations of text in individual language scripts was proposed in this study. Experimental results on two different datasets containing Nepali-English and Hindi-English code-switched texts, demonstrate the effectiveness of the proposed method. The interpretation of the model using model explainability technique illustrates the sharing of language-specific knowledge between language-specific representations.
{"title":"Share What You Already Know: Cross-Language-Script Transfer and Alignment for Sentiment Detection in Code-Mixed Data","authors":"Niraj Pahari, Kazutaka Shimada","doi":"10.1145/3661307","DOIUrl":"https://doi.org/10.1145/3661307","url":null,"abstract":"<p>Code-switching entails mixing multiple languages. It is an increasingly occurring phenomenon in social media texts. Usually, code-mixed texts are written in a single script, even though the languages involved have different scripts. Pre-trained multilingual models primarily utilize the data in the native script of the language. In existing studies, the code-switched texts are utilized as they are. However, using the native script for each language can generate better representations of the text owing to the pre-trained knowledge. Therefore, a cross-language-script knowledge sharing architecture utilizing the cross attention and alignment of the representations of text in individual language scripts was proposed in this study. Experimental results on two different datasets containing Nepali-English and Hindi-English code-switched texts, demonstrate the effectiveness of the proposed method. The interpretation of the model using model explainability technique illustrates the sharing of language-specific knowledge between language-specific representations.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"3 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140799398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Humour is a crucial aspect of human speech, and it is, therefore, imperative to create a system that can offer such detection. While data regarding humour in English speech is plentiful, the same cannot be said for a low-resource language like Hindi. Through this paper, we introduce two multimodal datasets for humour detection in the Hindi web series. The dataset was collected from over 500 minutes of conversations amongst the characters of the Hindi web series (Kota-Factory) and (Panchayat). Each dialogue is manually annotated as Humour or Non-Humour. Along with presenting a new Hindi language-based Humour detection dataset, we propose an improved framework for detecting humour in Hindi conversations. We start by preprocessing both datasets to obtain uniformity across the dialogues and datasets. The processed dialogues are then passed through the Skip-gram model for generating Hindi word embedding. The generated Hindi word embedding is then passed onto three convolutional neural network (CNN) architectures simultaneously, each having a different filter size for feature extraction. The extracted features are then passed through stacked Long Short-Term Memory (LSTM) layers for further processing and finally classifying the dialogues as Humour or Non-Humour. We conduct intensive experiments on both proposed Hindi datasets and evaluate several standard performance metrics. The performance of our proposed framework was also compared with several baselines and contemporary algorithms for Humour detection. The results demonstrate the effectiveness of our dataset to be used as a standard dataset for Humour detection in the Hindi web series. The proposed model yields an accuracy of 91.79 and 87.32 while an F1 score of 91.64 and 87.04 in percentage for the (Kota-Factory) and (Panchayat) datasets, respectively.
{"title":"HumourHindiNet: Humour detection in Hindi web series using word embedding and convolutional neural network","authors":"Akshi Kumar, Abhishek Mallik, Sanjay Kumar","doi":"10.1145/3661306","DOIUrl":"https://doi.org/10.1145/3661306","url":null,"abstract":"<p>Humour is a crucial aspect of human speech, and it is, therefore, imperative to create a system that can offer such detection. While data regarding humour in English speech is plentiful, the same cannot be said for a low-resource language like Hindi. Through this paper, we introduce two multimodal datasets for humour detection in the Hindi web series. The dataset was collected from over 500 minutes of conversations amongst the characters of the Hindi web series (Kota-Factory) and (Panchayat). Each dialogue is manually annotated as Humour or Non-Humour. Along with presenting a new Hindi language-based Humour detection dataset, we propose an improved framework for detecting humour in Hindi conversations. We start by preprocessing both datasets to obtain uniformity across the dialogues and datasets. The processed dialogues are then passed through the Skip-gram model for generating Hindi word embedding. The generated Hindi word embedding is then passed onto three convolutional neural network (CNN) architectures simultaneously, each having a different filter size for feature extraction. The extracted features are then passed through stacked Long Short-Term Memory (LSTM) layers for further processing and finally classifying the dialogues as Humour or Non-Humour. We conduct intensive experiments on both proposed Hindi datasets and evaluate several standard performance metrics. The performance of our proposed framework was also compared with several baselines and contemporary algorithms for Humour detection. The results demonstrate the effectiveness of our dataset to be used as a standard dataset for Humour detection in the Hindi web series. The proposed model yields an accuracy of 91.79 and 87.32 while an F1 score of 91.64 and 87.04 in percentage for the (Kota-Factory) and (Panchayat) datasets, respectively.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"7 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140799400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Newton's Second Law of Motion algorithm is crucial to interactive visual effects and interactive behavior in interface design. Designers can only utilize simple algorithm templates in interface design since they lack organized mathematical science, especially programming. Directly using Newton's Second Law of Motion algorithm introduces two interface design issues. First, the created picture has a simplistic impact, laborious interaction, too few interactive parts, and boring visual effects. Second, using this novel approach directly to interface design reduces creativity, originality, and cognitive inertia. This study suggests a Newton's Second Law-based algorithm modification. It provides a novel algorithm application idea and a design strategy based on algorithm change to enable new interface design. Algorithm design gives interface design a new viewpoint and improves content production. In the arithmetic process of Newton's second law of motion algorithm, the introduction of repulsive force, reset force, shape, color and other attributes of interactive objects, and the integration of other algorithms to transform its basic arithmetic logic, which is conducive to the improvement of the visual effect of interaction design. It also improves users' interaction experiences, sentiments, and desire to participate with design work.
{"title":"An Interaction-Design Method Based upon a Modified Algorithm of Newton's Second Law of Motion","authors":"Qiao Feng, Tian Huang","doi":"10.1145/3657634","DOIUrl":"https://doi.org/10.1145/3657634","url":null,"abstract":"<p>Newton's Second Law of Motion algorithm is crucial to interactive visual effects and interactive behavior in interface design. Designers can only utilize simple algorithm templates in interface design since they lack organized mathematical science, especially programming. Directly using Newton's Second Law of Motion algorithm introduces two interface design issues. First, the created picture has a simplistic impact, laborious interaction, too few interactive parts, and boring visual effects. Second, using this novel approach directly to interface design reduces creativity, originality, and cognitive inertia. This study suggests a Newton's Second Law-based algorithm modification. It provides a novel algorithm application idea and a design strategy based on algorithm change to enable new interface design. Algorithm design gives interface design a new viewpoint and improves content production. In the arithmetic process of Newton's second law of motion algorithm, the introduction of repulsive force, reset force, shape, color and other attributes of interactive objects, and the integration of other algorithms to transform its basic arithmetic logic, which is conducive to the improvement of the visual effect of interaction design. It also improves users' interaction experiences, sentiments, and desire to participate with design work.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"213 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140625576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aspect-based Sentiment Analysis (ABSA), also known as fine-grained sentiment analysis, aims to predict the sentiment polarity of specific aspect words in the sentence. Some studies have explored the semantic correlation between words in sentences through attention-based methods. Other studies have learned syntactic knowledge by using graph convolution networks to introduce dependency relations. These methods have achieved satisfactory results in the ABSA tasks. However, due to the complexity of language, effectively capturing semantic and syntactic knowledge remains a challenging research question. Therefore, we propose an Adaptive Dual Graph Convolution Fusion Network (AD-GCFN) for aspect-based sentiment analysis. This model uses two graph convolution networks: one for the semantic layer to learn semantic correlations by an attention mechanism, and the other for the syntactic layer to learn syntactic structure by dependency parsing. To reduce the noise caused by the attention mechanism, we designed a module that dynamically updates the graph structure information for adaptively aggregating node information. To effectively fuse semantic and syntactic information, we propose a cross-fusion module that uses the double random similarity matrix to obtain the syntactic features in the semantic space and the semantic features in the syntactic space, respectively. Additionally, we employ two regularizers to further improve the ability to capture semantic correlations. The orthogonal regularizer encourages the semantic layer to learn word semantics without overlap, while the differential regularizer encourages the semantic and syntactic layers to learn different parts. Finally, the experimental results on three benchmark datasets show that the AD-GCFN model is superior to the contrast models in terms of accuracy and macro-F1.
{"title":"An adaptive Dual Graph Convolution Fusion Network for Aspect-Based Sentiment Analysis","authors":"Chunmei Wang, Yuan Luo, Chunli Meng, Feiniu Yuan","doi":"10.1145/3659579","DOIUrl":"https://doi.org/10.1145/3659579","url":null,"abstract":"<p>Aspect-based Sentiment Analysis (ABSA), also known as fine-grained sentiment analysis, aims to predict the sentiment polarity of specific aspect words in the sentence. Some studies have explored the semantic correlation between words in sentences through attention-based methods. Other studies have learned syntactic knowledge by using graph convolution networks to introduce dependency relations. These methods have achieved satisfactory results in the ABSA tasks. However, due to the complexity of language, effectively capturing semantic and syntactic knowledge remains a challenging research question. Therefore, we propose an Adaptive Dual Graph Convolution Fusion Network (AD-GCFN) for aspect-based sentiment analysis. This model uses two graph convolution networks: one for the semantic layer to learn semantic correlations by an attention mechanism, and the other for the syntactic layer to learn syntactic structure by dependency parsing. To reduce the noise caused by the attention mechanism, we designed a module that dynamically updates the graph structure information for adaptively aggregating node information. To effectively fuse semantic and syntactic information, we propose a cross-fusion module that uses the double random similarity matrix to obtain the syntactic features in the semantic space and the semantic features in the syntactic space, respectively. Additionally, we employ two regularizers to further improve the ability to capture semantic correlations. The orthogonal regularizer encourages the semantic layer to learn word semantics without overlap, while the differential regularizer encourages the semantic and syntactic layers to learn different parts. Finally, the experimental results on three benchmark datasets show that the AD-GCFN model is superior to the contrast models in terms of accuracy and macro-F1.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"33 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140611097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Named Entity Recognition (NER) in low-resource settings aims to identify and categorize entities in a sentence with limited labeled data. Although prompt-based methods have succeeded in low-resource perspectives, challenges persist in effectively harnessing information and optimizing computational efficiency. In this work, we present a novel prompt-based method to enhance low-resource NER without exhaustive template tuning. First, we construct knowledge-enriched prompts by integrating representative entities and background information to provide informative supervision tailored to each entity type. Then, we introduce an efficient reverse generative framework inspired by QA, which avoids redundant computations. Finally, We reduce costs by generating entities from their types while retaining model reasoning capacity. Experiment results demonstrate that our method outperforms other baselines on three datasets under few-shot settings.
{"title":"Knowledge-Enriched Prompt for Low-Resource Named Entity Recognition","authors":"Wenlong Hou, Weidong Zhao, Xianhui Liu, WenYan Guo","doi":"10.1145/3659948","DOIUrl":"https://doi.org/10.1145/3659948","url":null,"abstract":"<p>Named Entity Recognition (NER) in low-resource settings aims to identify and categorize entities in a sentence with limited labeled data. Although prompt-based methods have succeeded in low-resource perspectives, challenges persist in effectively harnessing information and optimizing computational efficiency. In this work, we present a novel prompt-based method to enhance low-resource NER without exhaustive template tuning. First, we construct knowledge-enriched prompts by integrating representative entities and background information to provide informative supervision tailored to each entity type. Then, we introduce an efficient reverse generative framework inspired by QA, which avoids redundant computations. Finally, We reduce costs by generating entities from their types while retaining model reasoning capacity. Experiment results demonstrate that our method outperforms other baselines on three datasets under few-shot settings.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"9 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140617963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To improve the scalability of resources and ensure the effective sharing and utilization of online English resources, an online English resource integration algorithm based on high-dimensional mixed-attribute data mining is proposed. First, an integration structure based on high-dimensional mixed-attribute data mining is constructed. According to this structure, the characteristics of online English resources are extracted, and historical data mining is carried out in combination with the spatial distribution characteristics of resources. In this way, the spatial mapping function of features is established, and the optimal clustering center is designed according to the clustering and fusion structure of online English resources. At this node, the clustering and fusion of online English resources are carried out. According to the fusion results, the distribution structure model of online English resources is constructed, and the optimization research of the integration algorithm of online English resources is carried out. The experimental results show that the integration optimization efficiency of the proposed algorithm is 89%, and the packet loss rate is 0.19%. It has good integration performance, and can realize the integration of multi-channel and various forms of online English resources.
{"title":"Online English Resource Integration Algorithm based on high-dimensional Mixed Attribute Data Mining","authors":"Zhiyu Zhou","doi":"10.1145/3657289","DOIUrl":"https://doi.org/10.1145/3657289","url":null,"abstract":"<p>To improve the scalability of resources and ensure the effective sharing and utilization of online English resources, an online English resource integration algorithm based on high-dimensional mixed-attribute data mining is proposed. First, an integration structure based on high-dimensional mixed-attribute data mining is constructed. According to this structure, the characteristics of online English resources are extracted, and historical data mining is carried out in combination with the spatial distribution characteristics of resources. In this way, the spatial mapping function of features is established, and the optimal clustering center is designed according to the clustering and fusion structure of online English resources. At this node, the clustering and fusion of online English resources are carried out. According to the fusion results, the distribution structure model of online English resources is constructed, and the optimization research of the integration algorithm of online English resources is carried out. The experimental results show that the integration optimization efficiency of the proposed algorithm is 89%, and the packet loss rate is 0.19%. It has good integration performance, and can realize the integration of multi-channel and various forms of online English resources.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"38 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140594568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recommender systems, providing reasonable explanations can enhance users’ comprehension of recommended results. Template-based explainable recommendation heavily relies on pre-defined templates, constraining the expressiveness of generated sentences and resulting in low-quality explanations. Recently, a novel approach was introduced, utilizing embedding representations of items and comments to address the issue of user IDs and item IDs not residing in the same semantic space as words, thus attributing linguistic meaning to IDs. However, these models often fail to fully exploit collaborative information within the data. In personalized recommendation and explanation processes, understanding the user’s emotional feedback and feature preferences is paramount. To address this, we propose a personalized explainable recommendation model based on self-attention collaboration. Initially, the model employs an attention network to amalgamate the user’s historical interaction feature preferences with their user ID information, while simultaneously integrating all feature information of the item with its item ID to enhance semantic ID representation. Subsequently, the model incorporates the user’s comment feature rhetoric and sentiment feedback to generate more personalized recommendation explanations utilizing a self-attention network. Experimental evaluations conducted on two datasets of varying scales demonstrate the superiority of our model over current state-of-the-art approaches, validating its effectiveness.
在推荐系统中,提供合理的解释可以增强用户对推荐结果的理解。基于模板的可解释推荐严重依赖预定义模板,限制了生成句子的表达能力,导致解释质量低下。最近,有人提出了一种新方法,利用项目和评论的嵌入表示法来解决用户 ID 和项目 ID 与单词不在同一语义空间的问题,从而为 ID 赋予语言意义。然而,这些模型往往无法充分利用数据中的协作信息。在个性化推荐和解释过程中,了解用户的情感反馈和特征偏好至关重要。为此,我们提出了一种基于自我注意力协作的个性化可解释推荐模型。首先,该模型利用注意力网络将用户的历史交互特征偏好与其用户 ID 信息整合在一起,同时将物品的所有特征信息与其物品 ID 整合在一起,以增强语义 ID 表示。随后,该模型结合用户的评论特征修辞和情感反馈,利用自我关注网络生成更加个性化的推荐解释。在两个不同规模的数据集上进行的实验评估表明,我们的模型优于目前最先进的方法,验证了其有效性。
{"title":"Personalized Explainable Recommendations for Self-Attention Collaboration","authors":"Yongfu Zha, Xuanxuan Che, Lina Sun, Yumin Dong","doi":"10.1145/3657636","DOIUrl":"https://doi.org/10.1145/3657636","url":null,"abstract":"<p>In recommender systems, providing reasonable explanations can enhance users’ comprehension of recommended results. Template-based explainable recommendation heavily relies on pre-defined templates, constraining the expressiveness of generated sentences and resulting in low-quality explanations. Recently, a novel approach was introduced, utilizing embedding representations of items and comments to address the issue of user IDs and item IDs not residing in the same semantic space as words, thus attributing linguistic meaning to IDs. However, these models often fail to fully exploit collaborative information within the data. In personalized recommendation and explanation processes, understanding the user’s emotional feedback and feature preferences is paramount. To address this, we propose a personalized explainable recommendation model based on self-attention collaboration. Initially, the model employs an attention network to amalgamate the user’s historical interaction feature preferences with their user ID information, while simultaneously integrating all feature information of the item with its item ID to enhance semantic ID representation. Subsequently, the model incorporates the user’s comment feature rhetoric and sentiment feedback to generate more personalized recommendation explanations utilizing a self-attention network. Experimental evaluations conducted on two datasets of varying scales demonstrate the superiority of our model over current state-of-the-art approaches, validating its effectiveness.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"58 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140594623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the complexity of Chinese and the differences between Chinese and English, the application of Chinese text in the digital field has a certain complexity. Taking Chinese text in Open Relation Extraction (ORE) as the research object, the complexity of Chinese text is analyzed. An extraction system of word vectors based on construction grammar theory and Deep Learning (DL) is constructed to achieve smooth extraction of Chinese text. The work of this paper mainly includes the following aspects. To study the application of DL in the complexity analysis of Chinese text based on construction grammar, firstly, the connotation of construction grammar and its role in Chinese text analysis are explored. Secondly, from the perspective of the ORE of word vectors in language analysis, an ORE model based on word vectors is implemented. Moreover, an extraction method based on the distance of word vectors is proposed. The test results show that the F1 value of the proposed algorithm is 67% on the public WEB-500 and NYT-500 datasets, which is superior to other similar text extraction algorithms. When the recall rate is more than 30%, the accuracy of the proposed method is higher than several other latest language analysis systems. This indicates that the proposed Chinese text extraction system based on the DL algorithm and construction grammar theory has advantages in complexity analysis and can provide a new research idea for Chinese text analysis.
{"title":"Complexity Analysis of Chinese Text Based on the Construction Grammar Theory and Deep Learning","authors":"Changlin Wu, Changan Wu","doi":"10.1145/3625390","DOIUrl":"https://doi.org/10.1145/3625390","url":null,"abstract":"<p>Due to the complexity of Chinese and the differences between Chinese and English, the application of Chinese text in the digital field has a certain complexity. Taking Chinese text in Open Relation Extraction (ORE) as the research object, the complexity of Chinese text is analyzed. An extraction system of word vectors based on construction grammar theory and Deep Learning (DL) is constructed to achieve smooth extraction of Chinese text. The work of this paper mainly includes the following aspects. To study the application of DL in the complexity analysis of Chinese text based on construction grammar, firstly, the connotation of construction grammar and its role in Chinese text analysis are explored. Secondly, from the perspective of the ORE of word vectors in language analysis, an ORE model based on word vectors is implemented. Moreover, an extraction method based on the distance of word vectors is proposed. The test results show that the F1 value of the proposed algorithm is 67% on the public WEB-500 and NYT-500 datasets, which is superior to other similar text extraction algorithms. When the recall rate is more than 30%, the accuracy of the proposed method is higher than several other latest language analysis systems. This indicates that the proposed Chinese text extraction system based on the DL algorithm and construction grammar theory has advantages in complexity analysis and can provide a new research idea for Chinese text analysis.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"121 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140594756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Binarization of Tamizhi (Tamil-Brahmi) inscription images are highly challenging as it is captured from very old stone inscriptions that exists around 3rd century BCE in India. The difficulty is due to the degradation of these inscriptions by environmental factors and human negligence over ages. Though many works have been carried out in the binarization of inscription images, very few research was performed for inscription images and no work has been reported for binarization of inscriptions inscribed on irregular medium. The findings of the analysis hold true to all writings that are carved in irregular background. This paper reviews the performance of various binarization techniques on Tamizhi inscription images. Since no previous work was performed, we have applied the existing binarization algorithms on Tamizhi inscription images and analyzed the performance of these algorithms with proper reasoning. In future, we believe that this reasoning on the results will help a new researcher, to adapt or combine or devise new binarization techniques.
{"title":"Performance of Binarization Algorithms on Tamizhi Inscription Images: An Analysis","authors":"Monisha Munivel, V S Felix Enigo","doi":"10.1145/3656583","DOIUrl":"https://doi.org/10.1145/3656583","url":null,"abstract":"<p>Binarization of Tamizhi (Tamil-Brahmi) inscription images are highly challenging as it is captured from very old stone inscriptions that exists around 3rd century BCE in India. The difficulty is due to the degradation of these inscriptions by environmental factors and human negligence over ages. Though many works have been carried out in the binarization of inscription images, very few research was performed for inscription images and no work has been reported for binarization of inscriptions inscribed on irregular medium. The findings of the analysis hold true to all writings that are carved in irregular background. This paper reviews the performance of various binarization techniques on Tamizhi inscription images. Since no previous work was performed, we have applied the existing binarization algorithms on Tamizhi inscription images and analyzed the performance of these algorithms with proper reasoning. In future, we believe that this reasoning on the results will help a new researcher, to adapt or combine or devise new binarization techniques.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"102 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140594746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatic text summarization (ATS) provides a summary of distinct categories of information using natural language processing (NLP). Low-resource languages like Hindi have restricted applications of these techniques. This study proposes a method for automatically generating summaries of Hindi documents using extractive technique. The approach retrieves pertinent sentences from the source documents by employing multiple linguistic features and machine learning (ML) using maximum likelihood estimation (MLE) and maximum entropy (ME). We conducted pre-processing on the input documents, such as eliminating Hindi stop words and stemming. We have obtained 15 linguistic feature scores from each document to identify the phrases with high scores for summary generation. We have performed experiments over BBC News articles, CNN News, DUC 2004, Hindi Text Short Summarization Corpus, Indian Language News Text Summarization Corpus, and Wikipedia Articles for the proposed text summarizer. The Hindi Text Short Summarization Corpus and Indian Language News Text Summarization Corpus datasets are in Hindi, whereas BBC News articles, CNN News, and the DUC 2004 datasets have been translated into Hindi using Google, Microsoft Bing, and Systran translators for experiments. The summarization results have been calculated and shown for Hindi as well as for English to compare the performance of a low and rich-resource language. Multiple ROUGE metrics, along with precision, recall, and F-measure, have been used for the evaluation, which shows the better performance of the proposed method with multiple ROUGE scores. We compare the proposed method with the supervised and unsupervised machine learning methodologies, including support vector machine (SVM), Naive Bayes (NB), decision tree (DT), latent semantic analysis (LSA), latent Dirichlet allocation (LDA), and K-means clustering, and it was found that the proposed method outperforms these methods.
{"title":"Automatic Extractive Text Summarization using Multiple Linguistic Features","authors":"Pooja Gupta, Swati Nigam, Rajiv Singh","doi":"10.1145/3656471","DOIUrl":"https://doi.org/10.1145/3656471","url":null,"abstract":"<p>Automatic text summarization (ATS) provides a summary of distinct categories of information using natural language processing (NLP). Low-resource languages like Hindi have restricted applications of these techniques. This study proposes a method for automatically generating summaries of Hindi documents using extractive technique. The approach retrieves pertinent sentences from the source documents by employing multiple linguistic features and machine learning (ML) using maximum likelihood estimation (MLE) and maximum entropy (ME). We conducted pre-processing on the input documents, such as eliminating Hindi stop words and stemming. We have obtained 15 linguistic feature scores from each document to identify the phrases with high scores for summary generation. We have performed experiments over BBC News articles, CNN News, DUC 2004, Hindi Text Short Summarization Corpus, Indian Language News Text Summarization Corpus, and Wikipedia Articles for the proposed text summarizer. The Hindi Text Short Summarization Corpus and Indian Language News Text Summarization Corpus datasets are in Hindi, whereas BBC News articles, CNN News, and the DUC 2004 datasets have been translated into Hindi using Google, Microsoft Bing, and Systran translators for experiments. The summarization results have been calculated and shown for Hindi as well as for English to compare the performance of a low and rich-resource language. Multiple ROUGE metrics, along with precision, recall, and F-measure, have been used for the evaluation, which shows the better performance of the proposed method with multiple ROUGE scores. We compare the proposed method with the supervised and unsupervised machine learning methodologies, including support vector machine (SVM), Naive Bayes (NB), decision tree (DT), latent semantic analysis (LSA), latent Dirichlet allocation (LDA), and K-means clustering, and it was found that the proposed method outperforms these methods.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140594619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}