{"title":"基于机器学习和深度学习技术的中文新闻分类实证研究","authors":"Chuen-Min Huang, Yi-Jun Jiang","doi":"10.1109/ICMLC48188.2019.8949309","DOIUrl":null,"url":null,"abstract":"This study compares Chinese news classification results of machine learning (ML) and deep learning (DL). In processing ML, we chose Support Vector Machine (SVM) and Naive Bayes (NB) to form three models: Word2Vec-SVM, TFIDF-SVM, and TFIDF-NB. Since NB assumes that the words are independent, this is different from the concept of related word distribution in Word2Vec, so the combination with NB is excluded. In processing DL, we adopted Bidirectional Long Short-Term Memory (Bi-LSTM), Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN), and used Word2Vec for word embedding. Experimental results showed that with proper word preprocessing, the difference of classification accuracy of ML and DL models is actually very small. Although the results show that Bi-LSTM performs the most accurate and has the lowest Loss compared to other DL techniques, its implementation process is the most time consuming. This study affirms the excellent results of CNN, while its Loss is the highest of the DL models. We also found that Word2Vec-SVM was superior to TFIDF-SVM in terms of efficiency, but its accuracy is not as good as expected. To summarize the classification accuracy in Bi-LSTM, LSTM, CNN, Word2vec-SVM, TFIDF-SVM, and NB are 89.3%, 88%, and 87.54%, 85.32%, 87.35%, 86.56%, respectively.","PeriodicalId":221349,"journal":{"name":"2019 International Conference on Machine Learning and Cybernetics (ICMLC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"An Empirical Study on the Classification of Chinese News Articles by Machine Learning and Deep Learning Techniques\",\"authors\":\"Chuen-Min Huang, Yi-Jun Jiang\",\"doi\":\"10.1109/ICMLC48188.2019.8949309\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study compares Chinese news classification results of machine learning (ML) and deep learning (DL). In processing ML, we chose Support Vector Machine (SVM) and Naive Bayes (NB) to form three models: Word2Vec-SVM, TFIDF-SVM, and TFIDF-NB. Since NB assumes that the words are independent, this is different from the concept of related word distribution in Word2Vec, so the combination with NB is excluded. In processing DL, we adopted Bidirectional Long Short-Term Memory (Bi-LSTM), Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN), and used Word2Vec for word embedding. Experimental results showed that with proper word preprocessing, the difference of classification accuracy of ML and DL models is actually very small. Although the results show that Bi-LSTM performs the most accurate and has the lowest Loss compared to other DL techniques, its implementation process is the most time consuming. This study affirms the excellent results of CNN, while its Loss is the highest of the DL models. We also found that Word2Vec-SVM was superior to TFIDF-SVM in terms of efficiency, but its accuracy is not as good as expected. To summarize the classification accuracy in Bi-LSTM, LSTM, CNN, Word2vec-SVM, TFIDF-SVM, and NB are 89.3%, 88%, and 87.54%, 85.32%, 87.35%, 86.56%, respectively.\",\"PeriodicalId\":221349,\"journal\":{\"name\":\"2019 International Conference on Machine Learning and Cybernetics (ICMLC)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 International Conference on Machine Learning and Cybernetics (ICMLC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLC48188.2019.8949309\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Machine Learning and Cybernetics (ICMLC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLC48188.2019.8949309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An Empirical Study on the Classification of Chinese News Articles by Machine Learning and Deep Learning Techniques
This study compares Chinese news classification results of machine learning (ML) and deep learning (DL). In processing ML, we chose Support Vector Machine (SVM) and Naive Bayes (NB) to form three models: Word2Vec-SVM, TFIDF-SVM, and TFIDF-NB. Since NB assumes that the words are independent, this is different from the concept of related word distribution in Word2Vec, so the combination with NB is excluded. In processing DL, we adopted Bidirectional Long Short-Term Memory (Bi-LSTM), Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN), and used Word2Vec for word embedding. Experimental results showed that with proper word preprocessing, the difference of classification accuracy of ML and DL models is actually very small. Although the results show that Bi-LSTM performs the most accurate and has the lowest Loss compared to other DL techniques, its implementation process is the most time consuming. This study affirms the excellent results of CNN, while its Loss is the highest of the DL models. We also found that Word2Vec-SVM was superior to TFIDF-SVM in terms of efficiency, but its accuracy is not as good as expected. To summarize the classification accuracy in Bi-LSTM, LSTM, CNN, Word2vec-SVM, TFIDF-SVM, and NB are 89.3%, 88%, and 87.54%, 85.32%, 87.35%, 86.56%, respectively.