首页 > 最新文献

ACM Transactions on Asian and Low-Resource Language Information Processing最新文献

英文 中文
Crossing Linguistic Barriers: Authorship Attribution in Sinhala Texts 跨越语言障碍:僧伽罗语文本中的作者归属
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-03-30 DOI: 10.1145/3655620
Raheem Sarwar, Maneesha Perera, Pin Shen Teh, Raheel Nawaz, Muhammad Umair Hassan
Authorship attribution involves determining the original author of an anonymous text from a pool of potential authors. Author attribution task has applications in several domains, such as plagiarism detection, digital text forensics, and information retrieval. While these applications extend beyond any single language, existing research has predominantly centered on English, posing challenges for application in languages like Sinhala due to linguistic disparities and a lack of language processing tools. We present the first comprehensive study on cross-topic authorship attribution for Sinhala texts and propose a solution that can effectively perform the authorship attribution task even if the topics within the test and training samples differ. Our solution consists of three main parts: (i) extraction of topic-independent stylometric features, (ii) generation of the small candidate author set with the help of similarity search, and (iii) identification of the true author. Several experimental studies were carried out to demonstrate that the proposed solution can effectively handle real-world scenarios involving a large number of candidate authors and a limited number of text samples for each candidate author.
作者归属涉及从潜在作者库中确定匿名文本的原作者。作者归属任务可应用于多个领域,如剽窃检测、数字文本取证和信息检索。虽然这些应用超越了任何单一语言,但现有研究主要集中在英语上,由于语言差异和缺乏语言处理工具,在僧伽罗语等语言上的应用面临挑战。我们首次全面研究了僧伽罗语文本的跨主题作者归属问题,并提出了一种解决方案,即使测试样本和训练样本中的主题不同,也能有效执行作者归属任务。我们的解决方案包括三个主要部分:(i) 提取与主题无关的文体特征;(ii) 借助相似性搜索生成候选作者小集;(iii) 识别真正的作者。我们进行了多项实验研究,证明所提出的解决方案可以有效处理现实世界中涉及大量候选作者和每个候选作者有限数量文本样本的情况。
{"title":"Crossing Linguistic Barriers: Authorship Attribution in Sinhala Texts","authors":"Raheem Sarwar, Maneesha Perera, Pin Shen Teh, Raheel Nawaz, Muhammad Umair Hassan","doi":"10.1145/3655620","DOIUrl":"https://doi.org/10.1145/3655620","url":null,"abstract":"Authorship attribution involves determining the original author of an anonymous text from a pool of potential authors. Author attribution task has applications in several domains, such as plagiarism detection, digital text forensics, and information retrieval. While these applications extend beyond any single language, existing research has predominantly centered on English, posing challenges for application in languages like Sinhala due to linguistic disparities and a lack of language processing tools. We present the first comprehensive study on cross-topic authorship attribution for Sinhala texts and propose a solution that can effectively perform the authorship attribution task even if the topics within the test and training samples differ. Our solution consists of three main parts: (i) extraction of topic-independent stylometric features, (ii) generation of the small candidate author set with the help of similarity search, and (iii) identification of the true author. Several experimental studies were carried out to demonstrate that the proposed solution can effectively handle real-world scenarios involving a large number of candidate authors and a limited number of text samples for each candidate author.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140361338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learn More Manchu Words with A New Visual-Language Framework 利用新的可视化语言框架学习更多满语单词
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-03-28 DOI: 10.1145/3652992
Zhiwei, Wang, Siyang, Lu, Xiang, Wei, Run, Su, Yingjun, Qi, Wei, Lu

Manchu language, a minority language of China, is of significant historical and research value. An increasing number of Manchu documents are digitized into image format for better preservation and study. Recently, many researchers focused on identifying Manchu words in digitized documents. In previous approaches, a variety of Manchu words are recognized based on visual cues. However, we notice that visual-based approaches have some obvious drawbacks. On one hand, it is difficult to distinguish between similar and distorted letters. On the other hand, portions of letters obscured by breakage and stains are hard to identify. To cope with these two challenges, we propose a visual-language framework, namely the Visual-Language framework for Manchu word Recognition (VLMR), which fuses visual and semantic information to accurately recognize Manchu words. Whenever visual information is not available, the language model can automatically associate the semantics of words. The performance of our method is further enhanced by introducing a self-knowledge distillation network. In addition, we created a new handwritten Manchu word dataset named (HMW), which contains 6,721 handwritten Manchu words. The novel approach is evaluated on WMW and HMW. The experiments show that our proposed method achieves state-of-the-art performance on both datasets.

满语是中国的少数民族语言,具有重要的历史和研究价值。为了更好地保存和研究,越来越多的满文文献被数字化为图像格式。最近,许多研究人员开始关注识别数字化文献中的满文词汇。在以往的方法中,各种满文词汇都是基于视觉线索进行识别的。然而,我们注意到基于视觉的方法有一些明显的缺点。一方面,很难区分相似和扭曲的字母。另一方面,被破损和污渍遮挡的字母部分也很难识别。为了应对这两个挑战,我们提出了一种视觉语言框架,即满文词语识别的视觉语言框架(VLMR),该框架融合了视觉和语义信息,可以准确识别满文词语。在无法获得视觉信息的情况下,语言模型可以自动关联词语的语义。通过引入自我知识提炼网络,我们的方法性能得到了进一步提升。此外,我们还创建了一个新的手写满文词汇数据集,名为(HMW),其中包含 6721 个手写满文词汇。我们在 WMW 和 HMW 上对新方法进行了评估。实验结果表明,我们提出的方法在这两个数据集上都达到了最先进的性能。
{"title":"Learn More Manchu Words with A New Visual-Language Framework","authors":"Zhiwei, Wang, Siyang, Lu, Xiang, Wei, Run, Su, Yingjun, Qi, Wei, Lu","doi":"10.1145/3652992","DOIUrl":"https://doi.org/10.1145/3652992","url":null,"abstract":"<p>Manchu language, a minority language of China, is of significant historical and research value. An increasing number of Manchu documents are digitized into image format for better preservation and study. Recently, many researchers focused on identifying Manchu words in digitized documents. In previous approaches, a variety of Manchu words are recognized based on visual cues. However, we notice that visual-based approaches have some obvious drawbacks. On one hand, it is difficult to distinguish between similar and distorted letters. On the other hand, portions of letters obscured by breakage and stains are hard to identify. To cope with these two challenges, we propose a visual-language framework, namely the Visual-Language framework for Manchu word Recognition (VLMR), which fuses visual and semantic information to accurately recognize Manchu words. Whenever visual information is not available, the language model can automatically associate the semantics of words. The performance of our method is further enhanced by introducing a self-knowledge distillation network. In addition, we created a new handwritten Manchu word dataset named (HMW), which contains 6,721 handwritten Manchu words. The novel approach is evaluated on WMW and HMW. The experiments show that our proposed method achieves state-of-the-art performance on both datasets.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application of Hybrid Image Processing Based on Artificial Intelligence in Interactive English Teaching 基于人工智能的混合图像处理在互动英语教学中的应用
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-03-28 DOI: 10.1145/3626822
Dou Xin, Cuiping Shi

Primary school English teaching resources play an important role in primary school English teaching. The information age requires that primary school English teaching should strengthen the use of multimedia resources and gradually realize the diversification of teaching content. Expanded reality innovation is a sort of mixture picture handling innovation, which is one of the significant innovations that would influence the improvement of fundamental schooling in the following five years. It can seamlessly output virtual objects to the real environment, which is convenient for this paper to obtain and absorb information. It can also help students to participate in exploration and cultivate their creativity and imagination. It can strengthen the cooperation between students and teachers and create various learning environments. It has an immeasurable prospect of development in the field of education. The primary school English teaching resources based on augmented reality create a realistic learning situation from two-dimensional plane to three-dimensional three-dimensional display, and enrich the presentation of primary school English teaching content. It can stimulate students’ interest in learning English and promote the transformation of English teaching methods. It is a useful attempt in the field of education. This paper made statistics on the test results of the experimental class and the control class. Most of the scores of the experimental group were between 71 and 100, a total of 27, accounting for 67.5%. The score distribution of the control class was relatively balanced, with the highest number between 61-70, and the number was 10, accounting for 25%. Therefore, it can be seen that hybrid image processing technology is important for interactive English teaching.

小学英语教学资源在小学英语教学中发挥着重要作用。信息时代要求小学英语教学要加强多媒体资源的使用,逐步实现教学内容的多样化。拓展现实创新是一种混合图片处理创新,是未来五年影响学校基础教育提升的重大创新之一。它可以将虚拟对象无缝输出到现实环境中,便于本文获取和吸收信息。它还可以帮助学生参与探索,培养他们的创造力和想象力。它可以加强学生和教师之间的合作,创造各种学习环境。它在教育领域有着不可估量的发展前景。基于增强现实技术的小学英语教学资源,创设了从二维平面到三维立体展示的逼真学习情境,丰富了小学英语教学内容的呈现形式。它能激发学生学习英语的兴趣,促进英语教学方式的转变。是教育领域的一次有益尝试。本文对实验班和对照班的测试结果进行了统计。实验组的得分大多在 71 分至 100 分之间,共 27 人,占 67.5%。对照班的分数分布相对均衡,最高分在 61-70 分之间,人数为 10 人,占 25%。由此可见,混合图像处理技术在互动英语教学中的重要作用。
{"title":"Application of Hybrid Image Processing Based on Artificial Intelligence in Interactive English Teaching","authors":"Dou Xin, Cuiping Shi","doi":"10.1145/3626822","DOIUrl":"https://doi.org/10.1145/3626822","url":null,"abstract":"<p>Primary school English teaching resources play an important role in primary school English teaching. The information age requires that primary school English teaching should strengthen the use of multimedia resources and gradually realize the diversification of teaching content. Expanded reality innovation is a sort of mixture picture handling innovation, which is one of the significant innovations that would influence the improvement of fundamental schooling in the following five years. It can seamlessly output virtual objects to the real environment, which is convenient for this paper to obtain and absorb information. It can also help students to participate in exploration and cultivate their creativity and imagination. It can strengthen the cooperation between students and teachers and create various learning environments. It has an immeasurable prospect of development in the field of education. The primary school English teaching resources based on augmented reality create a realistic learning situation from two-dimensional plane to three-dimensional three-dimensional display, and enrich the presentation of primary school English teaching content. It can stimulate students’ interest in learning English and promote the transformation of English teaching methods. It is a useful attempt in the field of education. This paper made statistics on the test results of the experimental class and the control class. Most of the scores of the experimental group were between 71 and 100, a total of 27, accounting for 67.5%. The score distribution of the control class was relatively balanced, with the highest number between 61-70, and the number was 10, accounting for 25%. Therefore, it can be seen that hybrid image processing technology is important for interactive English teaching.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140313745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Syntax-aware Offensive Content Detection in Low-resourced Code-mixed Languages with Continual Pre-training 通过持续预训练在低资源代码混合语言中进行语法感知的攻击性内容检测
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-03-26 DOI: 10.1145/3653450
Necva Bölücü, Pelin Canbay

Social media is a widely used platform that includes a vast amount of user-generated content, allowing the extraction of information about users’ thoughts from texts. Individuals freely express their thoughts on these platforms, often without constraints, even if the content is offensive or contains hate speech. The identification and removal of offensive content from social media are imperative to prevent individuals or groups from becoming targets of harmful language. Despite extensive research on offensive content detection, addressing this challenge in code-mixed languages remains unsolved, characterised by issues such as imbalanced datasets and limited data sources. Most previous studies on detecting offensive content in these languages focus on creating datasets and applying deep neural networks, such as Recurrent Neural Networks (RNNs), or pre-trained language models (PLMs) such as BERT and its variations. Given the low-resource nature and imbalanced dataset issues inherent in these languages, this study delves into the efficacy of the syntax-aware BERT model with continual pre-training for the accurate identification of offensive content and proposes a framework called Cont-Syntax-BERT by combining continual learning with continual pre-training. Comprehensive experimental results demonstrate that the proposed Cont-Syntax-BERT framework outperforms state-of-the-art approaches. Notably, this framework addresses the challenges posed by code-mixed languages, as evidenced by its proficiency on the DravidianCodeMix [10,19] and HASOC 2109 [37] datasets. These results demonstrate the adaptability of the proposed framework in effectively addressing the challenges of code-mixed languages.

社交媒体是一个广泛使用的平台,包含大量用户生成的内容,可以从文本中提取有关用户思想的信息。个人在这些平台上自由表达自己的思想,通常不受任何限制,即使内容具有攻击性或包含仇恨言论。要防止个人或群体成为有害语言的攻击目标,就必须识别并删除社交媒体上的攻击性内容。尽管对攻击性内容检测进行了广泛的研究,但在代码混合语言中应对这一挑战的问题仍未得到解决,其特点是数据集不平衡和数据源有限。以往关于检测这些语言中攻击性内容的研究大多侧重于创建数据集和应用深度神经网络(如递归神经网络(RNN))或预训练语言模型(PLM)(如 BERT 及其变体)。鉴于这些语言固有的低资源性和不平衡数据集问题,本研究深入探讨了语法感知 BERT 模型与持续预训练在准确识别攻击性内容方面的功效,并通过将持续学习与持续预训练相结合,提出了一个名为 Cont-Syntax-BERT 的框架。综合实验结果表明,所提出的 Cont-Syntax-BERT 框架优于最先进的方法。值得注意的是,该框架能应对混合编码语言所带来的挑战,其在 DravidianCodeMix [10,19] 和 HASOC 2109 [37] 数据集上的出色表现就证明了这一点。这些结果表明,所提出的框架在有效应对代码混合语言挑战方面具有很强的适应性。
{"title":"Syntax-aware Offensive Content Detection in Low-resourced Code-mixed Languages with Continual Pre-training","authors":"Necva Bölücü, Pelin Canbay","doi":"10.1145/3653450","DOIUrl":"https://doi.org/10.1145/3653450","url":null,"abstract":"<p>Social media is a widely used platform that includes a vast amount of user-generated content, allowing the extraction of information about users’ thoughts from texts. Individuals freely express their thoughts on these platforms, often without constraints, even if the content is offensive or contains hate speech. The identification and removal of offensive content from social media are imperative to prevent individuals or groups from becoming targets of harmful language. Despite extensive research on offensive content detection, addressing this challenge in code-mixed languages remains unsolved, characterised by issues such as imbalanced datasets and limited data sources. Most previous studies on detecting offensive content in these languages focus on creating datasets and applying deep neural networks, such as Recurrent Neural Networks (RNNs), or pre-trained language models (PLMs) such as BERT and its variations. Given the low-resource nature and imbalanced dataset issues inherent in these languages, this study delves into the efficacy of the syntax-aware BERT model with continual pre-training for the accurate identification of offensive content and proposes a framework called Cont-Syntax-BERT by combining continual learning with continual pre-training. Comprehensive experimental results demonstrate that the proposed Cont-Syntax-BERT framework outperforms state-of-the-art approaches. Notably, this framework addresses the challenges posed by code-mixed languages, as evidenced by its proficiency on the DravidianCodeMix [10,19] and HASOC 2109 [37] datasets. These results demonstrate the adaptability of the proposed framework in effectively addressing the challenges of code-mixed languages.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140302332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Context-enhanced Adaptive Graph Network for Time-sensitive Question Answering 用于时敏问题解答的语境增强型自适应图网络
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-03-22 DOI: 10.1145/3653674
Jitong Li, Shaojuan Wu, Xiaowang Zhang, Zhiyong Feng

Time-sensitive question answering is to answer questions limited to certain timestamps based on the given long document, which mixes abundant temporal events with an explicit or implicit timestamp. While existing models make great progress in answering time-sensitive questions, their performance degrades dramatically when a long distance separates the correct answer from the timestamp mentioned in the question. In this paper, we propose a Context-enhanced Adaptive Graph network (CoAG) to capture long-distance dependencies between sentences within the extracted question-related episodes. Specifically, we propose a time-aware episode extraction module that obtains question-related context based on timestamps in the question and document. As the involvement of episodes confuses sentences with adjacent timestamps, an adaptive message passing mechanism is designed to capture and transfer inter-sentence differences. In addition, we present a hybrid text encoder to highlight question-related context built on global information. Experimental results show that CoAG significantly improves compared to state-of-the-art models on five benchmarks. Moreover, our model has a noticeable advantage in solving long-distance time-sensitive questions, improving the EM scores by 2.03% to 6.04% on TimeQA-Hard.

时敏问题解答是指根据给定的长文档回答仅限于特定时间戳的问题,该文档中混杂了大量带有显式或隐式时间戳的时间事件。虽然现有模型在回答时敏问题方面取得了很大进步,但当正确答案与问题中提到的时间戳相距甚远时,这些模型的性能就会急剧下降。在本文中,我们提出了一种上下文增强自适应图网络(CoAG)来捕捉提取的问题相关情节中句子之间的长距离依赖关系。具体来说,我们提出了一个时间感知的情节提取模块,该模块可根据问题和文档中的时间戳获取与问题相关的上下文。由于情节的参与会混淆时间戳相邻的句子,因此我们设计了一种自适应信息传递机制,以捕捉和传递句子间的差异。此外,我们还提出了一种混合文本编码器,以突出基于全局信息的问题相关上下文。实验结果表明,在五个基准测试中,CoAG 与最先进的模型相比有显著提高。此外,我们的模型在解决长距离时间敏感问题方面具有明显优势,在 TimeQA-Hard 上的 EM 分数提高了 2.03% 到 6.04%。
{"title":"A Context-enhanced Adaptive Graph Network for Time-sensitive Question Answering","authors":"Jitong Li, Shaojuan Wu, Xiaowang Zhang, Zhiyong Feng","doi":"10.1145/3653674","DOIUrl":"https://doi.org/10.1145/3653674","url":null,"abstract":"<p>Time-sensitive question answering is to answer questions limited to certain timestamps based on the given long document, which mixes abundant temporal events with an explicit or implicit timestamp. While existing models make great progress in answering time-sensitive questions, their performance degrades dramatically when a long distance separates the correct answer from the timestamp mentioned in the question. In this paper, we propose a Context-enhanced Adaptive Graph network (CoAG) to capture long-distance dependencies between sentences within the extracted question-related episodes. Specifically, we propose a time-aware episode extraction module that obtains question-related context based on timestamps in the question and document. As the involvement of episodes confuses sentences with adjacent timestamps, an adaptive message passing mechanism is designed to capture and transfer inter-sentence differences. In addition, we present a hybrid text encoder to highlight question-related context built on global information. Experimental results show that CoAG significantly improves compared to state-of-the-art models on five benchmarks. Moreover, our model has a noticeable advantage in solving long-distance time-sensitive questions, improving the EM scores by 2.03% to 6.04% on TimeQA-Hard.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140199927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Topic-Aware Masked Attentive Network for Information Cascade Prediction 用于信息级联预测的主题感知屏蔽注意力网络
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-03-21 DOI: 10.1145/3653449
Yu Tai, Hongwei Yang, Hui He, Xinglong Wu, Yuanming Shao, Weizhe Zhang, Arun Kumar Sangaiah

Predicting information cascades holds significant practical implications, including applications in public opinion analysis, rumor control, and product recommendation. Existing approaches have generally overlooked the significance of semantic topics in information cascades or disregarded the dissemination relations. Such models are inadequate in capturing the intricate diffusion process within an information network inundated with diverse topics. To address such problems, we propose a neural-based model (named ICP-TMAN) using Topic-Aware Masked Attentive Network for Information Cascade Prediction to predict the next infected node of an information cascade. First, we encode the topical text into user representation to perceive the user-topic dependency. Next, we employ a masked attentive network to devise the diffusion context to capture the user-context dependency. Finally, we exploit a deep attention mechanism to model historical infected nodes for user embedding enhancement to capture user-history dependency. The results of extensive experiments conducted on three real-world datasets demonstrate the superiority of ICP-TMAN over existing state-of-the-art approaches.

预测信息级联具有重要的现实意义,包括在舆论分析、谣言控制和产品推荐方面的应用。现有的方法通常忽视了信息级联中语义主题的重要性,或忽略了传播关系。这些模型不足以捕捉充斥着各种话题的信息网络中错综复杂的传播过程。为了解决这些问题,我们提出了一种基于神经网络的模型(名为 ICP-TMAN),即使用主题感知屏蔽注意力网络进行信息级联预测,以预测信息级联的下一个感染节点。首先,我们将主题文本编码为用户表征,以感知用户与主题之间的依赖关系。接着,我们利用掩码注意力网络来设计扩散上下文,以捕捉用户与上下文之间的依赖关系。最后,我们利用深度关注机制,为历史感染节点建模,以增强用户嵌入,从而捕捉用户与历史的依赖关系。在三个真实世界数据集上进行的大量实验结果表明,ICP-TMAN 优于现有的最先进方法。
{"title":"Topic-Aware Masked Attentive Network for Information Cascade Prediction","authors":"Yu Tai, Hongwei Yang, Hui He, Xinglong Wu, Yuanming Shao, Weizhe Zhang, Arun Kumar Sangaiah","doi":"10.1145/3653449","DOIUrl":"https://doi.org/10.1145/3653449","url":null,"abstract":"<p>Predicting information cascades holds significant practical implications, including applications in public opinion analysis, rumor control, and product recommendation. Existing approaches have generally overlooked the significance of semantic topics in information cascades or disregarded the dissemination relations. Such models are inadequate in capturing the intricate diffusion process within an information network inundated with diverse topics. To address such problems, we propose a neural-based model (named ICP-TMAN) using <underline>T</underline>opic-Aware <underline>M</underline>asked <underline>A</underline>ttentive <underline>N</underline>etwork for <underline>I</underline>nformation <underline>C</underline>ascade <underline>P</underline>rediction to predict the next infected node of an information cascade. First, we encode the topical text into user representation to perceive the user-topic dependency. Next, we employ a masked attentive network to devise the diffusion context to capture the user-context dependency. Finally, we exploit a deep attention mechanism to model historical infected nodes for user embedding enhancement to capture user-history dependency. The results of extensive experiments conducted on three real-world datasets demonstrate the superiority of ICP-TMAN over existing state-of-the-art approaches.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140199922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Domain Aspect-based Sentiment Classification with Pre-Training and Fine-Tuning Strategy for Low-Resource Domains 采用预训练和微调策略的基于方面的跨域情感分类,适用于低资源领域
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-03-21 DOI: 10.1145/3653299
Chunjun Zhao, Meiling Wu, Xinyi Yang, Xuzhuang Sun, Suge Wang, Deyu Li

Aspect-based sentiment classification (ABSC) is a crucial subtask of fine-grained sentiment analysis (SA), which aims to predict the sentiment polarity of the given aspects in a sentence as positive, negative, or neutral. Most existing ABSC methods based on supervised learning. However, these methods rely heavily on fine-grained labeled training data, which can be scarce in low-resource domains, limiting their effectiveness. To overcome this challenge, we propose a low-resource cross-domain aspect-based sentiment classification (CDABSC) approach based on a pre-training and fine-tuning strategy. This approach applies the pre-training and fine-tuning strategy to an advanced deep learning method designed for ABSC, namely the attention-based encoding graph convolutional network (AEGCN) model. Specifically, a high-resource domain is selected as the source domain, and the AEGCN model is pre-trained using a large amount of fine-grained annotated data from the source domain. The optimal parameters of the model are preserved. Subsequently, a low-resource domain is used as the target domain, and the pre-trained model parameters are used as the initial parameters of the target domain model. The target domain is fine-tuned using a small amount of annotated data to adapt the parameters to the target domain model, improving the accuracy of sentiment classification in the low-resource domain. Finally, experimental validation on two domain benchmark datasets, restaurant and laptop, demonstrates that significant outperformance of our approach over the baselines in CDABSC Micro-F1.

基于方面的情感分类(ABSC)是细粒度情感分析(SA)的一个重要子任务,其目的是预测句子中给定方面的情感极性是积极的、消极的还是中性的。现有的 ABSC 方法大多基于监督学习。然而,这些方法在很大程度上依赖于细粒度标记的训练数据,而这些数据在低资源领域可能非常稀缺,从而限制了它们的有效性。为了克服这一挑战,我们提出了一种基于预训练和微调策略的低资源跨域基于方面的情感分类(CDABSC)方法。该方法将预训练和微调策略应用于专为 ABSC 设计的高级深度学习方法,即基于注意力的编码图卷积网络(AEGCN)模型。具体来说,选择一个高资源域作为源域,使用源域中的大量细粒度注释数据对 AEGCN 模型进行预训练。模型的最佳参数被保留下来。随后,使用低资源域作为目标域,并将预训练的模型参数用作目标域模型的初始参数。使用少量注释数据对目标域进行微调,使参数适应目标域模型,从而提高低资源域情感分类的准确性。最后,在餐厅和笔记本电脑这两个领域基准数据集上进行的实验验证表明,我们的方法在 CDABSC Micro-F1 中的性能明显优于基线方法。
{"title":"Cross-Domain Aspect-based Sentiment Classification with Pre-Training and Fine-Tuning Strategy for Low-Resource Domains","authors":"Chunjun Zhao, Meiling Wu, Xinyi Yang, Xuzhuang Sun, Suge Wang, Deyu Li","doi":"10.1145/3653299","DOIUrl":"https://doi.org/10.1145/3653299","url":null,"abstract":"<p>Aspect-based sentiment classification (ABSC) is a crucial subtask of fine-grained sentiment analysis (SA), which aims to predict the sentiment polarity of the given aspects in a sentence as positive, negative, or neutral. Most existing ABSC methods based on supervised learning. However, these methods rely heavily on fine-grained labeled training data, which can be scarce in low-resource domains, limiting their effectiveness. To overcome this challenge, we propose a low-resource cross-domain aspect-based sentiment classification (CDABSC) approach based on a pre-training and fine-tuning strategy. This approach applies the pre-training and fine-tuning strategy to an advanced deep learning method designed for ABSC, namely the attention-based encoding graph convolutional network (AEGCN) model. Specifically, a high-resource domain is selected as the source domain, and the AEGCN model is pre-trained using a large amount of fine-grained annotated data from the source domain. The optimal parameters of the model are preserved. Subsequently, a low-resource domain is used as the target domain, and the pre-trained model parameters are used as the initial parameters of the target domain model. The target domain is fine-tuned using a small amount of annotated data to adapt the parameters to the target domain model, improving the accuracy of sentiment classification in the low-resource domain. Finally, experimental validation on two domain benchmark datasets, restaurant and laptop, demonstrates that significant outperformance of our approach over the baselines in CDABSC Micro-F1.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140199921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Supervised Contrast Learning Text Classification Model Based on Data Quality Augmentation 基于数据质量增强的有监督对比学习文本分类模型
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-03-19 DOI: 10.1145/3653300
Liang Wu, Fangfang Zhang, Chao Cheng, Shinan Song

Token-level data augmentation generates text samples by modifying the words of the sentences. However, data that are not easily classified can negatively affect the model. In particular, not considering the role of keywords when performing random augmentation operations on samples may lead to the generation of low-quality supplementary samples. Therefore, we propose a supervised contrast learning text classification model based on data quality augment (DQA). First, dynamic training is used to screen high-quality datasets containing beneficial information for model training. The selected data is then augmented with data based on important words with tag information. To obtain a better text representation to serve the downstream classification task, we employ a standard supervised contrast loss to train the model. Finally, we conduct experiments on five text classification datasets to validate the effectiveness of our model. In addition, ablation experiments are conducted to verify the impact of each module on classification.

标记级数据增强通过修改句子中的单词来生成文本样本。然而,不易分类的数据会对模型产生负面影响。特别是,在对样本进行随机扩增操作时,如果不考虑关键词的作用,可能会导致生成低质量的补充样本。因此,我们提出了一种基于数据质量增强(DQA)的有监督对比学习文本分类模型。首先,利用动态训练筛选出包含有益信息的高质量数据集,用于模型训练。然后,根据带有标签信息的重要词语对所选数据进行增强。为了获得更好的文本表示以服务于下游分类任务,我们采用了标准的监督对比度损失来训练模型。最后,我们在五个文本分类数据集上进行了实验,以验证模型的有效性。此外,我们还进行了消减实验,以验证每个模块对分类的影响。
{"title":"Supervised Contrast Learning Text Classification Model Based on Data Quality Augmentation","authors":"Liang Wu, Fangfang Zhang, Chao Cheng, Shinan Song","doi":"10.1145/3653300","DOIUrl":"https://doi.org/10.1145/3653300","url":null,"abstract":"<p>Token-level data augmentation generates text samples by modifying the words of the sentences. However, data that are not easily classified can negatively affect the model. In particular, not considering the role of keywords when performing random augmentation operations on samples may lead to the generation of low-quality supplementary samples. Therefore, we propose a supervised contrast learning text classification model based on data quality augment (DQA). First, dynamic training is used to screen high-quality datasets containing beneficial information for model training. The selected data is then augmented with data based on important words with tag information. To obtain a better text representation to serve the downstream classification task, we employ a standard supervised contrast loss to train the model. Finally, we conduct experiments on five text classification datasets to validate the effectiveness of our model. In addition, ablation experiments are conducted to verify the impact of each module on classification.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140169663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NPEL: Neural Paired Entity Linking in Web Tables NPEL:网络表格中的神经配对实体链接
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-03-19 DOI: 10.1145/3652511
Tianxing Wu, Lin Li, Huan Gao, Guilin Qi, Yuxiang Wang, Yuehua Li

This paper studies entity linking (EL) in Web tables, which aims to link the string mentions in table cells to their referent entities in a knowledge base. Two main problems exist in previous studies: 1) contextual information is not well utilized in mention-entity similarity computation; 2) the assumption on entity coherence that all entities in the same row or column are highly related to each other is not always correct. In this paper, we propose NPEL, a new Neural Paired Entity Linking framework, to overcome the above problems. In NPEL, we design a deep learning model with different neural networks and an attention mechanism, to model different kinds of contextual information of mentions and entities, for mention-entity similarity computation in Web tables. NPEL also relaxes the above assumption on entity coherence by a new paired entity linking algorithm, which iteratively selects two mentions with the highest confidence for EL. Experiments on real-world datasets exhibit that NPEL has the best performance compared with state-of-the-art baselines in different evaluation metrics.

本文研究了网络表格中的实体链接(EL),其目的是将表格单元格中提及的字符串与知识库中的参照实体链接起来。以往的研究存在两个主要问题:1)在计算提及-实体相似性时,上下文信息没有得到很好的利用;2)关于实体一致性的假设,即同一行或列中的所有实体彼此高度相关,并不总是正确的。本文提出了一种新的神经配对实体链接框架 NPEL,以克服上述问题。在 NPEL 中,我们设计了一个具有不同神经网络和注意力机制的深度学习模型,以模拟提及和实体的不同类型上下文信息,用于网络表格中的提及-实体相似性计算。NPEL 还通过一种新的配对实体链接算法放宽了对实体一致性的上述假设,该算法会迭代选择置信度最高的两个提及作为 EL。在实际数据集上的实验表明,在不同的评价指标上,NPEL 与最先进的基线相比具有最佳性能。
{"title":"NPEL: Neural Paired Entity Linking in Web Tables","authors":"Tianxing Wu, Lin Li, Huan Gao, Guilin Qi, Yuxiang Wang, Yuehua Li","doi":"10.1145/3652511","DOIUrl":"https://doi.org/10.1145/3652511","url":null,"abstract":"<p>This paper studies entity linking (EL) in Web tables, which aims to link the string mentions in table cells to their referent entities in a knowledge base. Two main problems exist in previous studies: 1) contextual information is not well utilized in mention-entity similarity computation; 2) the assumption on entity coherence that all entities in the same row or column are highly related to each other is not always correct. In this paper, we propose <b>NPEL</b>, a new <b>N</b>eural <b>P</b>aired <b>E</b>ntity <b>L</b>inking framework, to overcome the above problems. In NPEL, we design a deep learning model with different neural networks and an attention mechanism, to model different kinds of contextual information of mentions and entities, for mention-entity similarity computation in Web tables. NPEL also relaxes the above assumption on entity coherence by a new paired entity linking algorithm, which iteratively selects two mentions with the highest confidence for EL. Experiments on real-world datasets exhibit that NPEL has the best performance compared with state-of-the-art baselines in different evaluation metrics.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140169662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
THAR- Targeted Hate Speech Against Religion: A high-quality Hindi-English code-mixed Dataset with the Application of Deep Learning Models for Automatic Detection THAR--有针对性的反宗教仇恨言论:应用深度学习模型进行自动检测的高质量印地语-英语混合代码数据集
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-03-18 DOI: 10.1145/3653017
Deepawali Sharma, Aakash Singh, Vivek Kumar Singh

During the last decade, social media has gained significant popularity as a medium for individuals to express their views on various topics. However, some individuals also exploit the social media platforms to spread hatred through their comments and posts, some of which target individuals, communities or religions. Given the deep emotional connections people have to their religious beliefs, this form of hate speech can be divisive and harmful, and may result in issues of mental health as social disorder. Therefore, there is a need of algorithmic approaches for the automatic detection of instances of hate speech. Most of the existing studies in this area focus on social media content in English, and as a result several low-resource languages lack computational resources for the task. This study attempts to address this research gap by providing a high-quality annotated dataset designed specifically for identifying hate speech against religions in the Hindi-English code-mixed language. This dataset “Targeted Hate Speech Against Religion” (THAR)) consists of 11,549 comments and has been annotated by five independent annotators. It comprises two subtasks: (i) Subtask-1 (Binary classification), (ii) Subtask-2 (multi-class classification). To ensure the quality of annotation, the Fleiss Kappa measure has been employed. The suitability of the dataset is then further explored by applying different standard deep learning, and transformer-based models. The transformer-based model, namely Multilingual Representations for Indian Languages (MuRIL), is found to outperform the other implemented models in both subtasks, achieving macro average and weighted average F1 scores of 0.78 and 0.78 for Subtask-1, and 0.65 and 0.72 for Subtask-2, respectively. The experimental results obtained not only confirm the suitability of the dataset but also advance the research towards automatic detection of hate speech, particularly in the low-resource Hindi-English code-mixed language.

在过去十年中,社交媒体作为个人就各种话题表达观点的媒介大受欢迎。然而,一些人也利用社交媒体平台,通过评论和帖子散布仇恨,其中一些针对个人、社区或宗教。鉴于人们与其宗教信仰有着深厚的情感联系,这种形式的仇恨言论可能会造成分裂和伤害,并可能导致心理健康问题和社会混乱。因此,需要采用算法方法来自动检测仇恨言论。该领域的大多数现有研究都集中在英语社交媒体内容上,因此一些低资源语言缺乏完成该任务的计算资源。本研究试图通过提供一个高质量的注释数据集来解决这一研究空白,该数据集是专门为识别印地语-英语混合编码语言中针对宗教的仇恨言论而设计的。该数据集 "Targeted Hate Speech Against Religion"(THAR))由 11,549 条评论组成,并由五位独立注释者进行注释。它包括两个子任务:(i) 子任务-1(二元分类),(ii) 子任务-2(多类分类)。为确保标注质量,采用了 Fleiss Kappa 测量法。然后,通过应用不同的标准深度学习和基于转换器的模型,进一步探索数据集的适用性。结果发现,基于转换器的模型,即印度语言的多语言表征(MuRIL),在两个子任务中的表现均优于其他已实施的模型,在子任务-1 中的宏观平均和加权平均 F1 分数分别为 0.78 和 0.78,在子任务-2 中的宏观平均和加权平均 F1 分数分别为 0.65 和 0.72。实验结果不仅证实了数据集的适用性,还推动了仇恨言论自动检测研究的发展,尤其是在低资源的印地语-英语混合编码语言中。
{"title":"THAR- Targeted Hate Speech Against Religion: A high-quality Hindi-English code-mixed Dataset with the Application of Deep Learning Models for Automatic Detection","authors":"Deepawali Sharma, Aakash Singh, Vivek Kumar Singh","doi":"10.1145/3653017","DOIUrl":"https://doi.org/10.1145/3653017","url":null,"abstract":"<p>During the last decade, social media has gained significant popularity as a medium for individuals to express their views on various topics. However, some individuals also exploit the social media platforms to spread hatred through their comments and posts, some of which target individuals, communities or religions. Given the deep emotional connections people have to their religious beliefs, this form of hate speech can be divisive and harmful, and may result in issues of mental health as social disorder. Therefore, there is a need of algorithmic approaches for the automatic detection of instances of hate speech. Most of the existing studies in this area focus on social media content in English, and as a result several low-resource languages lack computational resources for the task. This study attempts to address this research gap by providing a high-quality annotated dataset designed specifically for identifying hate speech against religions in the Hindi-English code-mixed language. This dataset “Targeted Hate Speech Against Religion” (THAR)) consists of 11,549 comments and has been annotated by five independent annotators. It comprises two subtasks: (i) Subtask-1 (Binary classification), (ii) Subtask-2 (multi-class classification). To ensure the quality of annotation, the Fleiss Kappa measure has been employed. The suitability of the dataset is then further explored by applying different standard deep learning, and transformer-based models. The transformer-based model, namely Multilingual Representations for Indian Languages (MuRIL), is found to outperform the other implemented models in both subtasks, achieving macro average and weighted average F1 scores of 0.78 and 0.78 for Subtask-1, and 0.65 and 0.72 for Subtask-2, respectively. The experimental results obtained not only confirm the suitability of the dataset but also advance the research towards automatic detection of hate speech, particularly in the low-resource Hindi-English code-mixed language.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140146646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Asian and Low-Resource Language Information Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1