The meaning of the same word or sentence is likely to change in different semantic contexts, which challenges general-purpose translation system to maintain stable performance across different domains. Therefore, domain adaptation is an essential researching topic in Neural Machine Translation practice. In order to efficiently train translation models for different domains, in this work we take the Tibetan-Chinese general translation model as the parent model, and obtain two domain-specific Tibetan-Chinese translation models with small-scale in-domain data. The empirical results indicate that the method provides a positive approach for domain adaptation in low-resource scenarios, resulting in better bleu metrics as well as faster training speed over our general baseline models.
{"title":"Domain Adaptation for Tibetan-Chinese Neural Machine Translation","authors":"Maoxian Zhou, Jia Secha, Rangjia Cai","doi":"10.1145/3446132.3446404","DOIUrl":"https://doi.org/10.1145/3446132.3446404","url":null,"abstract":"The meaning of the same word or sentence is likely to change in different semantic contexts, which challenges general-purpose translation system to maintain stable performance across different domains. Therefore, domain adaptation is an essential researching topic in Neural Machine Translation practice. In order to efficiently train translation models for different domains, in this work we take the Tibetan-Chinese general translation model as the parent model, and obtain two domain-specific Tibetan-Chinese translation models with small-scale in-domain data. The empirical results indicate that the method provides a positive approach for domain adaptation in low-resource scenarios, resulting in better bleu metrics as well as faster training speed over our general baseline models.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132668561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Underwater optical images are scarce, and there are varying degrees of blur and color distortion, which brings great challenges to the detection of underwater objects. In view of the shortcomings of the original Single Shot MultiBox Detector (SSD), in this paper, a shallow object detection layer is added to the original SSD model to improve the network's ability to detect small objects. At the same time, this article improves the confidence loss to narrow the ability of SSD to detect different types of objects. Using the Multi-Scale Retinex with Color Restoration (MSRCR) algorithm to process the original images, enhance the feature information of the objects in the underwater images. Training the improved SSD network through transfer learning to overcome the limitations of insufficient underwater images. Experimental results show that the algorithm proposed in this paper has better detection performance than the original SSD, YOLO v3 and other algorithms, which is of great significance to the realization of underwater object detection.
水下光学图像稀缺,并且存在不同程度的模糊和色彩失真,这给水下物体的检测带来了很大的挑战。针对原有单射多盒检测器(Single Shot MultiBox Detector, SSD)存在的不足,本文在原有的SSD模型上增加了一个浅层的目标检测层,以提高网络对小目标的检测能力。同时,本文通过改进置信度损失来缩小SSD检测不同类型对象的能力。利用多尺度Retinex with Color Restoration (MSRCR)算法对原始图像进行处理,增强水下图像中物体的特征信息。通过迁移学习训练改进后的SSD网络,克服水下图像不足的局限性。实验结果表明,本文提出的算法比原有的SSD、YOLO v3等算法具有更好的检测性能,对实现水下目标检测具有重要意义。
{"title":"Underwater Object Detection Based on Improved Single Shot MultiBox Detector","authors":"Zhongyun Jiang, Rong-Sheng Wang","doi":"10.1145/3446132.3446170","DOIUrl":"https://doi.org/10.1145/3446132.3446170","url":null,"abstract":"Underwater optical images are scarce, and there are varying degrees of blur and color distortion, which brings great challenges to the detection of underwater objects. In view of the shortcomings of the original Single Shot MultiBox Detector (SSD), in this paper, a shallow object detection layer is added to the original SSD model to improve the network's ability to detect small objects. At the same time, this article improves the confidence loss to narrow the ability of SSD to detect different types of objects. Using the Multi-Scale Retinex with Color Restoration (MSRCR) algorithm to process the original images, enhance the feature information of the objects in the underwater images. Training the improved SSD network through transfer learning to overcome the limitations of insufficient underwater images. Experimental results show that the algorithm proposed in this paper has better detection performance than the original SSD, YOLO v3 and other algorithms, which is of great significance to the realization of underwater object detection.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120960456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatic keywords extraction refers to extracting words or phrases from a single text or text collection. Supervised methods outperform unsupervised methods, but it requires a large volume of labeled corpus for training. To address the problem, extra knowledge is obtained through labels generated by other tools. Moreover, the preprocessing of Chinese text is more challenging than that in English because of the fragments caused by word segment. Hence the named entity recognition in the preprocessing is introduced to enhance the accuracy. On the other hand, text contains different separate parts, and each part conveys information to readers on different levels. Thus, we present a text weighting method based on priority that takes into consideration the importance of different texture parts. In this paper, we integrate the three ideas above and propose a novel hybrid method for Chinese keywords extraction (HSCKE). To evaluate the performance of our proposed approach, we compare HSCKE with four most commonly used methods on two typical Chinese keywords extraction datasets. The experimental results show that the proposed approach achieves the optimal performance in terms of precision, recall and F1 score.
{"title":"HSCKE: A Hybrid Supervised Method for Chinese Keywords Extraction","authors":"Shuyu Kong, Ping Zhu, Qian Yang, Zhihua Wei","doi":"10.1145/3446132.3446408","DOIUrl":"https://doi.org/10.1145/3446132.3446408","url":null,"abstract":"Automatic keywords extraction refers to extracting words or phrases from a single text or text collection. Supervised methods outperform unsupervised methods, but it requires a large volume of labeled corpus for training. To address the problem, extra knowledge is obtained through labels generated by other tools. Moreover, the preprocessing of Chinese text is more challenging than that in English because of the fragments caused by word segment. Hence the named entity recognition in the preprocessing is introduced to enhance the accuracy. On the other hand, text contains different separate parts, and each part conveys information to readers on different levels. Thus, we present a text weighting method based on priority that takes into consideration the importance of different texture parts. In this paper, we integrate the three ideas above and propose a novel hybrid method for Chinese keywords extraction (HSCKE). To evaluate the performance of our proposed approach, we compare HSCKE with four most commonly used methods on two typical Chinese keywords extraction datasets. The experimental results show that the proposed approach achieves the optimal performance in terms of precision, recall and F1 score.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129533668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning is widely used in the field of biometrics, but a large amount of labeled image data is required to obtain a well-performing complicated model. Finger vein recognition has huge advantages over common biometric methods in terms of security and privacy. However, there are very few finger vein-related datasets. In order to solve this problem, this paper proposes a GAN-based finger vein dataset generation method, which is the first attempt in the domain of finger vein dataset generation by GAN. This paper generates a total of 53,630 images of 5,363 different subjects of finger veins and validates the synthetic dataset, which provides the basis for applying complex deep neural networks in the field of finger vein recognition.
{"title":"A GAN-based Method for Generating Finger Vein Dataset","authors":"Hanwen Yang, P. Fang, Zhiang Hao","doi":"10.1145/3446132.3446150","DOIUrl":"https://doi.org/10.1145/3446132.3446150","url":null,"abstract":"Deep learning is widely used in the field of biometrics, but a large amount of labeled image data is required to obtain a well-performing complicated model. Finger vein recognition has huge advantages over common biometric methods in terms of security and privacy. However, there are very few finger vein-related datasets. In order to solve this problem, this paper proposes a GAN-based finger vein dataset generation method, which is the first attempt in the domain of finger vein dataset generation by GAN. This paper generates a total of 53,630 images of 5,363 different subjects of finger veins and validates the synthetic dataset, which provides the basis for applying complex deep neural networks in the field of finger vein recognition.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131252060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Garbage recycling is becoming an urgent need for the people as the rapid development of human society is producing colossal amount of waste every year. However, current machine learning models for intelligent garbage detection and classification are highly constrained by their limited processing speeds and large model sizes, which make them difficult to be deployed on portable, real-time, and energy-efficient edge-computing devices. Therefore, in this paper, we introduce a novel YOLO-based neural network model with Variational Autoencoder (VAE) to increase the accuracy of automatic garbage recycling, accelerate the speed of calculation, and reduce the model size to make it feasible in the real-world garbage recycling scenario. The model is consisted of a convolutional feature extractor, a convolutional predictor, and a decoder. After the training process, this model achieves a correct rate of 69.70% with a total number of 32.1 million parameters and a speed of processing 60 Frames Per Second (FPS), surpassing the performance of other existing models such as YOLO v1 and Fast R-CNN.
{"title":"A YOLO-based Neural Network with VAE for Intelligent Garbage Detection and Classification","authors":"Anbang Ye, Bo Pang, Yucheng Jin, Jiahuan Cui","doi":"10.1145/3446132.3446400","DOIUrl":"https://doi.org/10.1145/3446132.3446400","url":null,"abstract":"Garbage recycling is becoming an urgent need for the people as the rapid development of human society is producing colossal amount of waste every year. However, current machine learning models for intelligent garbage detection and classification are highly constrained by their limited processing speeds and large model sizes, which make them difficult to be deployed on portable, real-time, and energy-efficient edge-computing devices. Therefore, in this paper, we introduce a novel YOLO-based neural network model with Variational Autoencoder (VAE) to increase the accuracy of automatic garbage recycling, accelerate the speed of calculation, and reduce the model size to make it feasible in the real-world garbage recycling scenario. The model is consisted of a convolutional feature extractor, a convolutional predictor, and a decoder. After the training process, this model achieves a correct rate of 69.70% with a total number of 32.1 million parameters and a speed of processing 60 Frames Per Second (FPS), surpassing the performance of other existing models such as YOLO v1 and Fast R-CNN.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115716623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hierarchical clustering algorithms based on node similarity have been widely used in community detection, but it is not suitable for signed networks. The typical signed network community detection algorithm has the problem of low community division rate from different nodes. Based on the similarity of nodes, this paper proposes the CDNS algorithm (Community Detection Algorithm based on Node Similarity in Signed Networks). Firstly, the algorithm proposes a node influence measure suitable for signed networks as the basis for selecting the initial node of the community. Secondly, it proposes the calculation of the node similarity based on the eigenvector centrality, and selects the node with the highest similarity from the initial node from the neighbour nodes to form the initial community. Finally, according to the community contribution of neighbour nodes, algorithm determines whether the neighbour nodes are joined in the community and in which order the neighbour nodes are joined in the community. The experiments of real signed network and simulated signed network prove that the CDNS algorithm has good accuracy and efficiency.
{"title":"Community Detection Algorithm based on Node Similarity in Signed Networks","authors":"Zhi Bie, Lufeng Qian, J. Ren","doi":"10.1145/3446132.3446184","DOIUrl":"https://doi.org/10.1145/3446132.3446184","url":null,"abstract":"Hierarchical clustering algorithms based on node similarity have been widely used in community detection, but it is not suitable for signed networks. The typical signed network community detection algorithm has the problem of low community division rate from different nodes. Based on the similarity of nodes, this paper proposes the CDNS algorithm (Community Detection Algorithm based on Node Similarity in Signed Networks). Firstly, the algorithm proposes a node influence measure suitable for signed networks as the basis for selecting the initial node of the community. Secondly, it proposes the calculation of the node similarity based on the eigenvector centrality, and selects the node with the highest similarity from the initial node from the neighbour nodes to form the initial community. Finally, according to the community contribution of neighbour nodes, algorithm determines whether the neighbour nodes are joined in the community and in which order the neighbour nodes are joined in the community. The experiments of real signed network and simulated signed network prove that the CDNS algorithm has good accuracy and efficiency.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116099579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xi Chen, Chongwu Dong, Jinghui Qin, Long Yin, Wushao Wen
Text classification is one of the most fundamental and important tasks in the field of natural language processing, which aims to identify the most relevant label for a given piece of text. Although deep learning-based text classification methods have achieved promising results, most researches mainly focus on the internal context information of the document, ignoring the available global information such as document hierarchy and label semantics. To address this problem, we propose a novel Label-Attentive Hierarchical Network (LAHN) for document classification. In particular, we integrate label information into the hierarchical structure of the document by calculating the word-label attention at word level and the sentence-label attention at sentence level respectively. We give full consideration to the global information during encoding the whole document, which makes the final document representation vector more discriminative for classification. Extensive experiments on several benchmark datasets show that our proposed LAHN surpasses several state-of-the-art methods.
{"title":"Label-Attentive Hierarchical Network for Document Classification","authors":"Xi Chen, Chongwu Dong, Jinghui Qin, Long Yin, Wushao Wen","doi":"10.1145/3446132.3446163","DOIUrl":"https://doi.org/10.1145/3446132.3446163","url":null,"abstract":"Text classification is one of the most fundamental and important tasks in the field of natural language processing, which aims to identify the most relevant label for a given piece of text. Although deep learning-based text classification methods have achieved promising results, most researches mainly focus on the internal context information of the document, ignoring the available global information such as document hierarchy and label semantics. To address this problem, we propose a novel Label-Attentive Hierarchical Network (LAHN) for document classification. In particular, we integrate label information into the hierarchical structure of the document by calculating the word-label attention at word level and the sentence-label attention at sentence level respectively. We give full consideration to the global information during encoding the whole document, which makes the final document representation vector more discriminative for classification. Extensive experiments on several benchmark datasets show that our proposed LAHN surpasses several state-of-the-art methods.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116511528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The uncertainty is an inherent feature of Knowledge Graph (KG), which is often modelled as confidence scores of relation facts. Although Knowledge Graph Embedding (KGE) has been a great success recently, it is still a big challenge to predict confidence of unseen facts in KG in the continuous vector space. There are several reasons for this situation. First, the current KGE is often concerned with the deterministic knowledge, in which unseen facts’ confidence are treated as zero, otherwise as one. Second, in the embedding space, uncertainty features are not well preserved. Third, approximate reasoning in embedding spaces is often unexplainable and not intuitive. Furthermore, the time and space cost of obtaining embedding spaces with uncertainty preserved are always very high. To address these issues, considering Uncertain Knowledge Graph (UKG), we propose a fast and effective embedding method, UKGsE, in which approximate reasoning and calculation can be quickly performed after generating an Uncertain Knowledge Graph Embedding (UKGE) space in a high speed and reasonable accuracy. The idea is that treating relation facts as short sentences and pre-handling are benefit to the learning and training confidence scores of them. The experiment shows that the method is suitable for the downstream task, confidence prediction of relation facts, whether they are seen in UKG or not. It achieves the best tradeoff between efficiency and accuracy of predicting uncertain confidence of knowledge. Further, we found that the model outperforms state-of-the-art uncertain link prediction baselines on CN15k dataset.
{"title":"Fast Confidence Prediction of Uncertainty based on Knowledge Graph Embedding","authors":"Shihan Yang, Weiya Zhang, R. Tang","doi":"10.1145/3446132.3446186","DOIUrl":"https://doi.org/10.1145/3446132.3446186","url":null,"abstract":"The uncertainty is an inherent feature of Knowledge Graph (KG), which is often modelled as confidence scores of relation facts. Although Knowledge Graph Embedding (KGE) has been a great success recently, it is still a big challenge to predict confidence of unseen facts in KG in the continuous vector space. There are several reasons for this situation. First, the current KGE is often concerned with the deterministic knowledge, in which unseen facts’ confidence are treated as zero, otherwise as one. Second, in the embedding space, uncertainty features are not well preserved. Third, approximate reasoning in embedding spaces is often unexplainable and not intuitive. Furthermore, the time and space cost of obtaining embedding spaces with uncertainty preserved are always very high. To address these issues, considering Uncertain Knowledge Graph (UKG), we propose a fast and effective embedding method, UKGsE, in which approximate reasoning and calculation can be quickly performed after generating an Uncertain Knowledge Graph Embedding (UKGE) space in a high speed and reasonable accuracy. The idea is that treating relation facts as short sentences and pre-handling are benefit to the learning and training confidence scores of them. The experiment shows that the method is suitable for the downstream task, confidence prediction of relation facts, whether they are seen in UKG or not. It achieves the best tradeoff between efficiency and accuracy of predicting uncertain confidence of knowledge. Further, we found that the model outperforms state-of-the-art uncertain link prediction baselines on CN15k dataset.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122930858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aiming at the problem that the traditional video-based drilling pipe counting method has low accuracy and is vulnerable to interference in the process of positioning and tracking targets, a drilling pipe counting method based on scale space and Siamese network was proposed: the shape features of the drilling machine video image were calculated by the improved scale space algorithm, the initial position of the drilling machine chuck was determined by feature matching, the chuck was tracked in real time according to the improved Siamese network algorithm and its movement trajectory was recorded, moreover, the number of drilling pipes was calculated after locally weighted regression and hierarchical classification of the chuck movement trajectory using counting rules. The test results showed that the improved method could stably track the target under the interference of bright light and realize the accurate counting of drilling pipe.
{"title":"Drill Pipe Counting Method Based on Scale Space and Siamese Network","authors":"Lihong Dong, Xinyi Wu, Jiehui Zhang","doi":"10.1145/3446132.3446179","DOIUrl":"https://doi.org/10.1145/3446132.3446179","url":null,"abstract":"Aiming at the problem that the traditional video-based drilling pipe counting method has low accuracy and is vulnerable to interference in the process of positioning and tracking targets, a drilling pipe counting method based on scale space and Siamese network was proposed: the shape features of the drilling machine video image were calculated by the improved scale space algorithm, the initial position of the drilling machine chuck was determined by feature matching, the chuck was tracked in real time according to the improved Siamese network algorithm and its movement trajectory was recorded, moreover, the number of drilling pipes was calculated after locally weighted regression and hierarchical classification of the chuck movement trajectory using counting rules. The test results showed that the improved method could stably track the target under the interference of bright light and realize the accurate counting of drilling pipe.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124819180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Converting source codes to feature vectors can be useful in programming-related tasks, such as plagiarism detection on ACM contests. We present a brand-new method for feature extraction from C++ files, which includes both features describing syntactic and lexical properties of an AST tree and features characterizing disassembly of source code. We propose a method for solving the plagiarism detection task as a classification problem. We prove the effectiveness of our feature set by testing on a dataset that contains 50 ACM problems and ∼90k solutions for them. Trained xgboost model gets a relative binary f1-score=0.745 on the test set.
{"title":"A Machine Learning Based Plagiarism Detection in Source Code","authors":"N. Viuginov, P. Grachev, A. Filchenkov","doi":"10.1145/3446132.3446420","DOIUrl":"https://doi.org/10.1145/3446132.3446420","url":null,"abstract":"Converting source codes to feature vectors can be useful in programming-related tasks, such as plagiarism detection on ACM contests. We present a brand-new method for feature extraction from C++ files, which includes both features describing syntactic and lexical properties of an AST tree and features characterizing disassembly of source code. We propose a method for solving the plagiarism detection task as a classification problem. We prove the effectiveness of our feature set by testing on a dataset that contains 50 ACM problems and ∼90k solutions for them. Trained xgboost model gets a relative binary f1-score=0.745 on the test set.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130198416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}