Deep Linear Discriminative Analysis (DeepLDA) is an effective feature learning method that combines LDA with deep neural network. The core of DeepLDA is putting a LDA based loss function on the top of deep neural network, which is constructed by fully-connected layers. Generally speaking, fully-connected layers will lead to a large consumption of computing resource. What’s more, capacity of the deep neural network may too large to fit training data properly when fully-connected layers are used. Thus, performance of DeepLDA may be improved by increasing sparsity of the deep neural network. In this paper, a sparse training strategy is exploited to train DeepLDA. Dense layers in DeepLDA are replaced by a Erdös-Rényi random graph based sparse topology first. Then, sparse evolutionary training (SET) strategy is employed to train DeepLDA. Preliminary experiments show that DeepLDA trained with SET strategy outperforms DeepLDA trained with fully-connected layers on MINST classification task.
深度线性判别分析(Deep Linear Discriminative Analysis, DeepLDA)是一种将深度线性判别分析与深度神经网络相结合的有效特征学习方法。DeepLDA的核心是将一个基于LDA的损失函数放在由全连接层构成的深度神经网络的顶层。一般来说,全连接层会导致大量的计算资源消耗。此外,当使用全连接层时,深度神经网络的容量可能太大而无法正确拟合训练数据。因此,可以通过增加深度神经网络的稀疏度来提高DeepLDA的性能。本文采用稀疏训练策略对DeepLDA进行训练。DeepLDA中的密集层首先被基于稀疏拓扑的Erdös-Rényi随机图所取代。然后,采用稀疏进化训练(SET)策略对DeepLDA进行训练。初步实验表明,SET策略训练的DeepLDA在MINST分类任务上优于全连接层训练的DeepLDA。
{"title":"A Sparse Deep Linear Discriminative Analysis using Sparse Evolutionary Training","authors":"Xuefeng Bai, Lijun Yan","doi":"10.1145/3446132.3446167","DOIUrl":"https://doi.org/10.1145/3446132.3446167","url":null,"abstract":"Deep Linear Discriminative Analysis (DeepLDA) is an effective feature learning method that combines LDA with deep neural network. The core of DeepLDA is putting a LDA based loss function on the top of deep neural network, which is constructed by fully-connected layers. Generally speaking, fully-connected layers will lead to a large consumption of computing resource. What’s more, capacity of the deep neural network may too large to fit training data properly when fully-connected layers are used. Thus, performance of DeepLDA may be improved by increasing sparsity of the deep neural network. In this paper, a sparse training strategy is exploited to train DeepLDA. Dense layers in DeepLDA are replaced by a Erdös-Rényi random graph based sparse topology first. Then, sparse evolutionary training (SET) strategy is employed to train DeepLDA. Preliminary experiments show that DeepLDA trained with SET strategy outperforms DeepLDA trained with fully-connected layers on MINST classification task.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130678055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, the genetic algorithm and the dynamic programming algorithm are used to solve the 0-1 knapsack problem, and the principles and implementation process of the two methods are analyzed. For the two methods, the initial condition values are changed respectively, and the running time, the number of iterations and the accuracy of the running results of each algorithm under different conditions are compared and analyzed, with the reasons for the differences are studied to show the characteristics in order to find different features of these algorithms.
{"title":"Comparison of genetic algorithm and dynamic programming solving knapsack problem","authors":"Yan Wang, M. Wang, Jia Li, Xiang Xu","doi":"10.1145/3446132.3446142","DOIUrl":"https://doi.org/10.1145/3446132.3446142","url":null,"abstract":"In this paper, the genetic algorithm and the dynamic programming algorithm are used to solve the 0-1 knapsack problem, and the principles and implementation process of the two methods are analyzed. For the two methods, the initial condition values are changed respectively, and the running time, the number of iterations and the accuracy of the running results of each algorithm under different conditions are compared and analyzed, with the reasons for the differences are studied to show the characteristics in order to find different features of these algorithms.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134394333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatic keywords extraction refers to extracting words or phrases from a single text or text collection. Supervised methods outperform unsupervised methods, but it requires a large volume of labeled corpus for training. To address the problem, extra knowledge is obtained through labels generated by other tools. Moreover, the preprocessing of Chinese text is more challenging than that in English because of the fragments caused by word segment. Hence the named entity recognition in the preprocessing is introduced to enhance the accuracy. On the other hand, text contains different separate parts, and each part conveys information to readers on different levels. Thus, we present a text weighting method based on priority that takes into consideration the importance of different texture parts. In this paper, we integrate the three ideas above and propose a novel hybrid method for Chinese keywords extraction (HSCKE). To evaluate the performance of our proposed approach, we compare HSCKE with four most commonly used methods on two typical Chinese keywords extraction datasets. The experimental results show that the proposed approach achieves the optimal performance in terms of precision, recall and F1 score.
{"title":"HSCKE: A Hybrid Supervised Method for Chinese Keywords Extraction","authors":"Shuyu Kong, Ping Zhu, Qian Yang, Zhihua Wei","doi":"10.1145/3446132.3446408","DOIUrl":"https://doi.org/10.1145/3446132.3446408","url":null,"abstract":"Automatic keywords extraction refers to extracting words or phrases from a single text or text collection. Supervised methods outperform unsupervised methods, but it requires a large volume of labeled corpus for training. To address the problem, extra knowledge is obtained through labels generated by other tools. Moreover, the preprocessing of Chinese text is more challenging than that in English because of the fragments caused by word segment. Hence the named entity recognition in the preprocessing is introduced to enhance the accuracy. On the other hand, text contains different separate parts, and each part conveys information to readers on different levels. Thus, we present a text weighting method based on priority that takes into consideration the importance of different texture parts. In this paper, we integrate the three ideas above and propose a novel hybrid method for Chinese keywords extraction (HSCKE). To evaluate the performance of our proposed approach, we compare HSCKE with four most commonly used methods on two typical Chinese keywords extraction datasets. The experimental results show that the proposed approach achieves the optimal performance in terms of precision, recall and F1 score.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129533668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning is widely used in the field of biometrics, but a large amount of labeled image data is required to obtain a well-performing complicated model. Finger vein recognition has huge advantages over common biometric methods in terms of security and privacy. However, there are very few finger vein-related datasets. In order to solve this problem, this paper proposes a GAN-based finger vein dataset generation method, which is the first attempt in the domain of finger vein dataset generation by GAN. This paper generates a total of 53,630 images of 5,363 different subjects of finger veins and validates the synthetic dataset, which provides the basis for applying complex deep neural networks in the field of finger vein recognition.
{"title":"A GAN-based Method for Generating Finger Vein Dataset","authors":"Hanwen Yang, P. Fang, Zhiang Hao","doi":"10.1145/3446132.3446150","DOIUrl":"https://doi.org/10.1145/3446132.3446150","url":null,"abstract":"Deep learning is widely used in the field of biometrics, but a large amount of labeled image data is required to obtain a well-performing complicated model. Finger vein recognition has huge advantages over common biometric methods in terms of security and privacy. However, there are very few finger vein-related datasets. In order to solve this problem, this paper proposes a GAN-based finger vein dataset generation method, which is the first attempt in the domain of finger vein dataset generation by GAN. This paper generates a total of 53,630 images of 5,363 different subjects of finger veins and validates the synthetic dataset, which provides the basis for applying complex deep neural networks in the field of finger vein recognition.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131252060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Garbage recycling is becoming an urgent need for the people as the rapid development of human society is producing colossal amount of waste every year. However, current machine learning models for intelligent garbage detection and classification are highly constrained by their limited processing speeds and large model sizes, which make them difficult to be deployed on portable, real-time, and energy-efficient edge-computing devices. Therefore, in this paper, we introduce a novel YOLO-based neural network model with Variational Autoencoder (VAE) to increase the accuracy of automatic garbage recycling, accelerate the speed of calculation, and reduce the model size to make it feasible in the real-world garbage recycling scenario. The model is consisted of a convolutional feature extractor, a convolutional predictor, and a decoder. After the training process, this model achieves a correct rate of 69.70% with a total number of 32.1 million parameters and a speed of processing 60 Frames Per Second (FPS), surpassing the performance of other existing models such as YOLO v1 and Fast R-CNN.
{"title":"A YOLO-based Neural Network with VAE for Intelligent Garbage Detection and Classification","authors":"Anbang Ye, Bo Pang, Yucheng Jin, Jiahuan Cui","doi":"10.1145/3446132.3446400","DOIUrl":"https://doi.org/10.1145/3446132.3446400","url":null,"abstract":"Garbage recycling is becoming an urgent need for the people as the rapid development of human society is producing colossal amount of waste every year. However, current machine learning models for intelligent garbage detection and classification are highly constrained by their limited processing speeds and large model sizes, which make them difficult to be deployed on portable, real-time, and energy-efficient edge-computing devices. Therefore, in this paper, we introduce a novel YOLO-based neural network model with Variational Autoencoder (VAE) to increase the accuracy of automatic garbage recycling, accelerate the speed of calculation, and reduce the model size to make it feasible in the real-world garbage recycling scenario. The model is consisted of a convolutional feature extractor, a convolutional predictor, and a decoder. After the training process, this model achieves a correct rate of 69.70% with a total number of 32.1 million parameters and a speed of processing 60 Frames Per Second (FPS), surpassing the performance of other existing models such as YOLO v1 and Fast R-CNN.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115716623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hierarchical clustering algorithms based on node similarity have been widely used in community detection, but it is not suitable for signed networks. The typical signed network community detection algorithm has the problem of low community division rate from different nodes. Based on the similarity of nodes, this paper proposes the CDNS algorithm (Community Detection Algorithm based on Node Similarity in Signed Networks). Firstly, the algorithm proposes a node influence measure suitable for signed networks as the basis for selecting the initial node of the community. Secondly, it proposes the calculation of the node similarity based on the eigenvector centrality, and selects the node with the highest similarity from the initial node from the neighbour nodes to form the initial community. Finally, according to the community contribution of neighbour nodes, algorithm determines whether the neighbour nodes are joined in the community and in which order the neighbour nodes are joined in the community. The experiments of real signed network and simulated signed network prove that the CDNS algorithm has good accuracy and efficiency.
{"title":"Community Detection Algorithm based on Node Similarity in Signed Networks","authors":"Zhi Bie, Lufeng Qian, J. Ren","doi":"10.1145/3446132.3446184","DOIUrl":"https://doi.org/10.1145/3446132.3446184","url":null,"abstract":"Hierarchical clustering algorithms based on node similarity have been widely used in community detection, but it is not suitable for signed networks. The typical signed network community detection algorithm has the problem of low community division rate from different nodes. Based on the similarity of nodes, this paper proposes the CDNS algorithm (Community Detection Algorithm based on Node Similarity in Signed Networks). Firstly, the algorithm proposes a node influence measure suitable for signed networks as the basis for selecting the initial node of the community. Secondly, it proposes the calculation of the node similarity based on the eigenvector centrality, and selects the node with the highest similarity from the initial node from the neighbour nodes to form the initial community. Finally, according to the community contribution of neighbour nodes, algorithm determines whether the neighbour nodes are joined in the community and in which order the neighbour nodes are joined in the community. The experiments of real signed network and simulated signed network prove that the CDNS algorithm has good accuracy and efficiency.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116099579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xi Chen, Chongwu Dong, Jinghui Qin, Long Yin, Wushao Wen
Text classification is one of the most fundamental and important tasks in the field of natural language processing, which aims to identify the most relevant label for a given piece of text. Although deep learning-based text classification methods have achieved promising results, most researches mainly focus on the internal context information of the document, ignoring the available global information such as document hierarchy and label semantics. To address this problem, we propose a novel Label-Attentive Hierarchical Network (LAHN) for document classification. In particular, we integrate label information into the hierarchical structure of the document by calculating the word-label attention at word level and the sentence-label attention at sentence level respectively. We give full consideration to the global information during encoding the whole document, which makes the final document representation vector more discriminative for classification. Extensive experiments on several benchmark datasets show that our proposed LAHN surpasses several state-of-the-art methods.
{"title":"Label-Attentive Hierarchical Network for Document Classification","authors":"Xi Chen, Chongwu Dong, Jinghui Qin, Long Yin, Wushao Wen","doi":"10.1145/3446132.3446163","DOIUrl":"https://doi.org/10.1145/3446132.3446163","url":null,"abstract":"Text classification is one of the most fundamental and important tasks in the field of natural language processing, which aims to identify the most relevant label for a given piece of text. Although deep learning-based text classification methods have achieved promising results, most researches mainly focus on the internal context information of the document, ignoring the available global information such as document hierarchy and label semantics. To address this problem, we propose a novel Label-Attentive Hierarchical Network (LAHN) for document classification. In particular, we integrate label information into the hierarchical structure of the document by calculating the word-label attention at word level and the sentence-label attention at sentence level respectively. We give full consideration to the global information during encoding the whole document, which makes the final document representation vector more discriminative for classification. Extensive experiments on several benchmark datasets show that our proposed LAHN surpasses several state-of-the-art methods.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116511528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The uncertainty is an inherent feature of Knowledge Graph (KG), which is often modelled as confidence scores of relation facts. Although Knowledge Graph Embedding (KGE) has been a great success recently, it is still a big challenge to predict confidence of unseen facts in KG in the continuous vector space. There are several reasons for this situation. First, the current KGE is often concerned with the deterministic knowledge, in which unseen facts’ confidence are treated as zero, otherwise as one. Second, in the embedding space, uncertainty features are not well preserved. Third, approximate reasoning in embedding spaces is often unexplainable and not intuitive. Furthermore, the time and space cost of obtaining embedding spaces with uncertainty preserved are always very high. To address these issues, considering Uncertain Knowledge Graph (UKG), we propose a fast and effective embedding method, UKGsE, in which approximate reasoning and calculation can be quickly performed after generating an Uncertain Knowledge Graph Embedding (UKGE) space in a high speed and reasonable accuracy. The idea is that treating relation facts as short sentences and pre-handling are benefit to the learning and training confidence scores of them. The experiment shows that the method is suitable for the downstream task, confidence prediction of relation facts, whether they are seen in UKG or not. It achieves the best tradeoff between efficiency and accuracy of predicting uncertain confidence of knowledge. Further, we found that the model outperforms state-of-the-art uncertain link prediction baselines on CN15k dataset.
{"title":"Fast Confidence Prediction of Uncertainty based on Knowledge Graph Embedding","authors":"Shihan Yang, Weiya Zhang, R. Tang","doi":"10.1145/3446132.3446186","DOIUrl":"https://doi.org/10.1145/3446132.3446186","url":null,"abstract":"The uncertainty is an inherent feature of Knowledge Graph (KG), which is often modelled as confidence scores of relation facts. Although Knowledge Graph Embedding (KGE) has been a great success recently, it is still a big challenge to predict confidence of unseen facts in KG in the continuous vector space. There are several reasons for this situation. First, the current KGE is often concerned with the deterministic knowledge, in which unseen facts’ confidence are treated as zero, otherwise as one. Second, in the embedding space, uncertainty features are not well preserved. Third, approximate reasoning in embedding spaces is often unexplainable and not intuitive. Furthermore, the time and space cost of obtaining embedding spaces with uncertainty preserved are always very high. To address these issues, considering Uncertain Knowledge Graph (UKG), we propose a fast and effective embedding method, UKGsE, in which approximate reasoning and calculation can be quickly performed after generating an Uncertain Knowledge Graph Embedding (UKGE) space in a high speed and reasonable accuracy. The idea is that treating relation facts as short sentences and pre-handling are benefit to the learning and training confidence scores of them. The experiment shows that the method is suitable for the downstream task, confidence prediction of relation facts, whether they are seen in UKG or not. It achieves the best tradeoff between efficiency and accuracy of predicting uncertain confidence of knowledge. Further, we found that the model outperforms state-of-the-art uncertain link prediction baselines on CN15k dataset.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122930858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aiming at the problem that the traditional video-based drilling pipe counting method has low accuracy and is vulnerable to interference in the process of positioning and tracking targets, a drilling pipe counting method based on scale space and Siamese network was proposed: the shape features of the drilling machine video image were calculated by the improved scale space algorithm, the initial position of the drilling machine chuck was determined by feature matching, the chuck was tracked in real time according to the improved Siamese network algorithm and its movement trajectory was recorded, moreover, the number of drilling pipes was calculated after locally weighted regression and hierarchical classification of the chuck movement trajectory using counting rules. The test results showed that the improved method could stably track the target under the interference of bright light and realize the accurate counting of drilling pipe.
{"title":"Drill Pipe Counting Method Based on Scale Space and Siamese Network","authors":"Lihong Dong, Xinyi Wu, Jiehui Zhang","doi":"10.1145/3446132.3446179","DOIUrl":"https://doi.org/10.1145/3446132.3446179","url":null,"abstract":"Aiming at the problem that the traditional video-based drilling pipe counting method has low accuracy and is vulnerable to interference in the process of positioning and tracking targets, a drilling pipe counting method based on scale space and Siamese network was proposed: the shape features of the drilling machine video image were calculated by the improved scale space algorithm, the initial position of the drilling machine chuck was determined by feature matching, the chuck was tracked in real time according to the improved Siamese network algorithm and its movement trajectory was recorded, moreover, the number of drilling pipes was calculated after locally weighted regression and hierarchical classification of the chuck movement trajectory using counting rules. The test results showed that the improved method could stably track the target under the interference of bright light and realize the accurate counting of drilling pipe.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124819180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Converting source codes to feature vectors can be useful in programming-related tasks, such as plagiarism detection on ACM contests. We present a brand-new method for feature extraction from C++ files, which includes both features describing syntactic and lexical properties of an AST tree and features characterizing disassembly of source code. We propose a method for solving the plagiarism detection task as a classification problem. We prove the effectiveness of our feature set by testing on a dataset that contains 50 ACM problems and ∼90k solutions for them. Trained xgboost model gets a relative binary f1-score=0.745 on the test set.
{"title":"A Machine Learning Based Plagiarism Detection in Source Code","authors":"N. Viuginov, P. Grachev, A. Filchenkov","doi":"10.1145/3446132.3446420","DOIUrl":"https://doi.org/10.1145/3446132.3446420","url":null,"abstract":"Converting source codes to feature vectors can be useful in programming-related tasks, such as plagiarism detection on ACM contests. We present a brand-new method for feature extraction from C++ files, which includes both features describing syntactic and lexical properties of an AST tree and features characterizing disassembly of source code. We propose a method for solving the plagiarism detection task as a classification problem. We prove the effectiveness of our feature set by testing on a dataset that contains 50 ACM problems and ∼90k solutions for them. Trained xgboost model gets a relative binary f1-score=0.745 on the test set.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130198416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}