Topical phrases representing the document and used in various fields are called keyphrases. Various methods are proposed to extract keyphrases automatically. However, most methods rely on candidate selection using linguistic heuristics in the English language. In this work for Thai keyphrases extraction, the candidate selection based on Universal Dependencies (UD) is proposed rather than using only POS sequence to make this step language independent. To enhance candidate selection, tree-based keyphrases extraction is also adapted to keep only logical candidates based on the cohesiveness index (CI). Besides that, the score filtering is proposed to combine linguistic heuristics, like stop words and the phrase's position. In the experiments, our method gained the double averaged F1 score of the state-of-the-art method, even though the UD was trained by only 1,781 EDUs and achieved 84% labeled attachment score. In addition, ablation studies on each factor in score filtering revealed which factor is important for keyphrase extraction.
{"title":"Enhancing Thai Keyphrase Extraction Using Syntactic Relations: An Adoption of Universal Dependencies Framework","authors":"Chanatip Saetia, Tawunrat Chalothorn, Supawat Taerungruang","doi":"10.1109/iSAI-NLP56921.2022.9960284","DOIUrl":"https://doi.org/10.1109/iSAI-NLP56921.2022.9960284","url":null,"abstract":"Topical phrases representing the document and used in various fields are called keyphrases. Various methods are proposed to extract keyphrases automatically. However, most methods rely on candidate selection using linguistic heuristics in the English language. In this work for Thai keyphrases extraction, the candidate selection based on Universal Dependencies (UD) is proposed rather than using only POS sequence to make this step language independent. To enhance candidate selection, tree-based keyphrases extraction is also adapted to keep only logical candidates based on the cohesiveness index (CI). Besides that, the score filtering is proposed to combine linguistic heuristics, like stop words and the phrase's position. In the experiments, our method gained the double averaged F1 score of the state-of-the-art method, even though the UD was trained by only 1,781 EDUs and achieved 84% labeled attachment score. In addition, ablation studies on each factor in score filtering revealed which factor is important for keyphrase extraction.","PeriodicalId":399019,"journal":{"name":"2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131502684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-05DOI: 10.1109/iSAI-NLP56921.2022.9960280
W. Limprasert, P. Jantana, Avirut Liangsiri
Many critical tasks such as document approval and banking services, which are now hosted on cloud infrastructure. This transformation introduces stress on cloud security from the physical layer of the data center to the application layer of web application. All data access and service access need to be monitored and responded to in real-time. In this paper, we study methods to detect anomaly incidents such as spikes from network volume, malicious incidents from API scanning, error messages from internal systems and timeout from Slowloris attack[l]. We select machine learning based anomaly detection algorithms, such as LOF, Isolation Forest and Elliptic Envelope, to find suitable methods to detect incidents in real-time using stream processing tools including Kafka and message ingression. The result shows that LOF is fast and robust in most of the cases. However, when log messages have unseen words, which normally need to be hashed to preprocess, the Isolation Forest shows better results. This study shows the possibility of applying stream processing with machine learning to detect anomaly behavior for cloud services.
{"title":"Anomaly Detection on Real-time Security Log using Stream Processing","authors":"W. Limprasert, P. Jantana, Avirut Liangsiri","doi":"10.1109/iSAI-NLP56921.2022.9960280","DOIUrl":"https://doi.org/10.1109/iSAI-NLP56921.2022.9960280","url":null,"abstract":"Many critical tasks such as document approval and banking services, which are now hosted on cloud infrastructure. This transformation introduces stress on cloud security from the physical layer of the data center to the application layer of web application. All data access and service access need to be monitored and responded to in real-time. In this paper, we study methods to detect anomaly incidents such as spikes from network volume, malicious incidents from API scanning, error messages from internal systems and timeout from Slowloris attack[l]. We select machine learning based anomaly detection algorithms, such as LOF, Isolation Forest and Elliptic Envelope, to find suitable methods to detect incidents in real-time using stream processing tools including Kafka and message ingression. The result shows that LOF is fast and robust in most of the cases. However, when log messages have unseen words, which normally need to be hashed to preprocess, the Isolation Forest shows better results. This study shows the possibility of applying stream processing with machine learning to detect anomaly behavior for cloud services.","PeriodicalId":399019,"journal":{"name":"2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133709294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-05DOI: 10.1109/iSAI-NLP56921.2022.9960267
Zar Zar Hlaing, Ye Kyaw Thu, T. Supnithi, P. Netisopakul
Examining the relationships between words in a sentence to determine its grammatical structure is known as dependency parsing (DP). Based on this, a sentence is broken down into several components. The process is based on the concept that every linguistic component of a sentence has a direct relationship to one another. These relationships are called dependencies. Dependency parsing is one of the key steps in natural language processing (NLP) for several text mining approaches. As the dominant formalism for dependency parsing in recent years, Universal Dependencies (UD) have emerged. The various UD corpus and dependency parsers are publicly accessible for resource-rich languages. However, there are no publicly available resources for dependency parsing, especially for the low-resource language, Myanmar. Thus, we manually extended the existing small Myanmar UD corpus (i.e., myPOS UD corpus) as myPOS version 3.0 UD corpus to publish the extended Myanmar UD corpus as the publicly available resource. To evaluate the effects of the extended UD corpus versus the original UD corpus, we utilized the graph-based neural dependency parsing models, namely, jPTDP (joint POS tagging and dependency parsing) and UniParse (universal graph-based parsing), and the evaluation scores are measured in terms of unlabeled and labeled attachment scores: (UAS) and (LAS). We compared the accuracies of graph-based neural models based on the original and extended UD corpora. The experimental results showed that, compared to the original myPOS UD corpus, the extended myPOS version 3.0 UD corpus enhanced the accuracy of dependency parsing models.
{"title":"Graph-based Dependency Parser Building for Myanmar Language","authors":"Zar Zar Hlaing, Ye Kyaw Thu, T. Supnithi, P. Netisopakul","doi":"10.1109/iSAI-NLP56921.2022.9960267","DOIUrl":"https://doi.org/10.1109/iSAI-NLP56921.2022.9960267","url":null,"abstract":"Examining the relationships between words in a sentence to determine its grammatical structure is known as dependency parsing (DP). Based on this, a sentence is broken down into several components. The process is based on the concept that every linguistic component of a sentence has a direct relationship to one another. These relationships are called dependencies. Dependency parsing is one of the key steps in natural language processing (NLP) for several text mining approaches. As the dominant formalism for dependency parsing in recent years, Universal Dependencies (UD) have emerged. The various UD corpus and dependency parsers are publicly accessible for resource-rich languages. However, there are no publicly available resources for dependency parsing, especially for the low-resource language, Myanmar. Thus, we manually extended the existing small Myanmar UD corpus (i.e., myPOS UD corpus) as myPOS version 3.0 UD corpus to publish the extended Myanmar UD corpus as the publicly available resource. To evaluate the effects of the extended UD corpus versus the original UD corpus, we utilized the graph-based neural dependency parsing models, namely, jPTDP (joint POS tagging and dependency parsing) and UniParse (universal graph-based parsing), and the evaluation scores are measured in terms of unlabeled and labeled attachment scores: (UAS) and (LAS). We compared the accuracies of graph-based neural models based on the original and extended UD corpora. The experimental results showed that, compared to the original myPOS UD corpus, the extended myPOS version 3.0 UD corpus enhanced the accuracy of dependency parsing models.","PeriodicalId":399019,"journal":{"name":"2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130635463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Despite reaching satisfactory verification performance, variousness utterance duration and phonemes and the robustness of the system remain a challenge in speaker ver-ification tasks. To deal with this challenge, we propose RAS-E2E, a novel fully cross-lingual speaker verification system that discovers meaningful information from input raw waveforms of various duration utterances, including short utterance duration, to determine whether an utterance matches the target speaker by merging two powerful paradigms: SincNet and Rawnet training scheme with Bi-RNN. The conducted experiments on Voxceleb, Gowajee and internal call-center datasets demonstrate that RAS-E2E achieves better performance compared to the recent verification systems on waveforms.
{"title":"RAS-E2E: The SincNet end-to-end with RawNet loss for text-independent speaker verification","authors":"Pantid Chantangphol, Theerat Sakdejayont, Tawunrat Chalothorn","doi":"10.1109/iSAI-NLP56921.2022.9960255","DOIUrl":"https://doi.org/10.1109/iSAI-NLP56921.2022.9960255","url":null,"abstract":"Despite reaching satisfactory verification performance, variousness utterance duration and phonemes and the robustness of the system remain a challenge in speaker ver-ification tasks. To deal with this challenge, we propose RAS-E2E, a novel fully cross-lingual speaker verification system that discovers meaningful information from input raw waveforms of various duration utterances, including short utterance duration, to determine whether an utterance matches the target speaker by merging two powerful paradigms: SincNet and Rawnet training scheme with Bi-RNN. The conducted experiments on Voxceleb, Gowajee and internal call-center datasets demonstrate that RAS-E2E achieves better performance compared to the recent verification systems on waveforms.","PeriodicalId":399019,"journal":{"name":"2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131416463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-05DOI: 10.1109/iSAI-NLP56921.2022.9960265
Parun Ngamcharoen, Nuttapong Sanglerdsinlapachai, P. Vejjanugraha
Traditionally, the training phase of abstractive text summarization involves inputting two sets of integer sequences; the first set representing the source text, and the second set representing words existing in the reference summary, into the encoder and decoder parts of the model, respectively. However, by using this method, the model tends to perform poorly if the source text includes words which are irrelevant or insignificant to the key ideas. In order to address this issue, we propose a new keywords-based method for abstractive summarization by combining the information provided by the source text and its keywords to generate summary. We utilize a bi-directional long short-term memory model for keyword labelling, using overlapping words between the source text and the reference summary as ground truth. The results obtained from our experiments on ThaiSum dataset show that our proposed method outperforms the traditional encoder-decoder model by 0.0425 on ROUGE-1 F1, 0.0301 on ROUGE-2 F1 and 0.0140 on BERTScore Fl.
{"title":"Automatic Thai Text Summarization Using Keyword-Based Abstractive Method","authors":"Parun Ngamcharoen, Nuttapong Sanglerdsinlapachai, P. Vejjanugraha","doi":"10.1109/iSAI-NLP56921.2022.9960265","DOIUrl":"https://doi.org/10.1109/iSAI-NLP56921.2022.9960265","url":null,"abstract":"Traditionally, the training phase of abstractive text summarization involves inputting two sets of integer sequences; the first set representing the source text, and the second set representing words existing in the reference summary, into the encoder and decoder parts of the model, respectively. However, by using this method, the model tends to perform poorly if the source text includes words which are irrelevant or insignificant to the key ideas. In order to address this issue, we propose a new keywords-based method for abstractive summarization by combining the information provided by the source text and its keywords to generate summary. We utilize a bi-directional long short-term memory model for keyword labelling, using overlapping words between the source text and the reference summary as ground truth. The results obtained from our experiments on ThaiSum dataset show that our proposed method outperforms the traditional encoder-decoder model by 0.0425 on ROUGE-1 F1, 0.0301 on ROUGE-2 F1 and 0.0140 on BERTScore Fl.","PeriodicalId":399019,"journal":{"name":"2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125286388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}