使用深度学习和迁移学习的专利分类用例

Journal of data and information science (Warsaw, Poland) Pub Date : 2022-08-01 DOI:10.2478/jdis-2022-0015

R. Henriques, Adria Ferreira, M. Castelli

{"title":"使用深度学习和迁移学习的专利分类用例","authors":"R. Henriques, Adria Ferreira, M. Castelli","doi":"10.2478/jdis-2022-0015","DOIUrl":null,"url":null,"abstract":"Abstract Purpose Patent classification is one of the areas in Intellectual Property Analytics (IPA), and a growing use case since the number of patent applications has been increasing worldwide. We propose using machine learning algorithms to classify Portuguese patents and evaluate the performance of transfer learning methodologies to solve this task. Design/methodology/approach We applied three different approaches in this paper. First, we used a dataset available by INPI to explore traditional machine learning algorithms and ensemble methods. After preprocessing data by applying TF-IDF, FastText and Doc2Vec, the models were evaluated by cross-validation in 5 folds. In a second approach, we used two different Neural Networks architectures, a Convolutional Neural Network (CNN) and a bi-directional Long Short-Term Memory (BiLSTM). Finally, we used pre-trained BERT, DistilBERT, and ULMFiT models in the third approach. Findings BERTTimbau, a BERT architecture model pre-trained on a large Portuguese corpus, presented the best results for the task, even though with a performance of only 4% superior to a LinearSVC model using TF-IDF feature engineering. Research limitations The dataset was highly imbalanced, as usual in patent applications, so the classes with the lowest samples were expected to present the worst performance. That result happened in some cases, especially in classes with less than 60 training samples. Practical implications Patent classification is challenging because of the hierarchical classification system, the context overlap, and the underrepresentation of the classes. However, the final model presented an acceptable performance given the size of the dataset and the task complexity. This model can support the decision and improve the time by proposing a category in the second level of ICP, which is one of the critical phases of the grant patent process. Originality/value To our knowledge, the proposed models were never implemented for Portuguese patent classification.","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"7 1","pages":"49 - 70"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A Use Case of Patent Classification Using Deep Learning with Transfer Learning\",\"authors\":\"R. Henriques, Adria Ferreira, M. Castelli\",\"doi\":\"10.2478/jdis-2022-0015\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Purpose Patent classification is one of the areas in Intellectual Property Analytics (IPA), and a growing use case since the number of patent applications has been increasing worldwide. We propose using machine learning algorithms to classify Portuguese patents and evaluate the performance of transfer learning methodologies to solve this task. Design/methodology/approach We applied three different approaches in this paper. First, we used a dataset available by INPI to explore traditional machine learning algorithms and ensemble methods. After preprocessing data by applying TF-IDF, FastText and Doc2Vec, the models were evaluated by cross-validation in 5 folds. In a second approach, we used two different Neural Networks architectures, a Convolutional Neural Network (CNN) and a bi-directional Long Short-Term Memory (BiLSTM). Finally, we used pre-trained BERT, DistilBERT, and ULMFiT models in the third approach. Findings BERTTimbau, a BERT architecture model pre-trained on a large Portuguese corpus, presented the best results for the task, even though with a performance of only 4% superior to a LinearSVC model using TF-IDF feature engineering. Research limitations The dataset was highly imbalanced, as usual in patent applications, so the classes with the lowest samples were expected to present the worst performance. That result happened in some cases, especially in classes with less than 60 training samples. Practical implications Patent classification is challenging because of the hierarchical classification system, the context overlap, and the underrepresentation of the classes. However, the final model presented an acceptable performance given the size of the dataset and the task complexity. This model can support the decision and improve the time by proposing a category in the second level of ICP, which is one of the critical phases of the grant patent process. Originality/value To our knowledge, the proposed models were never implemented for Portuguese patent classification.\",\"PeriodicalId\":92237,\"journal\":{\"name\":\"Journal of data and information science (Warsaw, Poland)\",\"volume\":\"7 1\",\"pages\":\"49 - 70\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of data and information science (Warsaw, Poland)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2478/jdis-2022-0015\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of data and information science (Warsaw, Poland)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/jdis-2022-0015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

摘要目的专利分类是知识产权分析（IPA）中的一个领域，自全球专利申请数量不断增加以来，它的使用案例越来越多。我们建议使用机器学习算法对葡萄牙专利进行分类，并评估迁移学习方法的性能以解决这一任务。设计/方法论/方法论我们在本文中采用了三种不同的方法。首先，我们使用INPI提供的数据集来探索传统的机器学习算法和集成方法。在应用TF-IDF、FastText和Doc2Vec对数据进行预处理后，通过交叉验证对模型进行了5次评估。在第二种方法中，我们使用了两种不同的神经网络架构，卷积神经网络（CNN）和双向长短期存储器（BiLSTM）。最后，我们在第三种方法中使用了预训练的BERT、DistilBERT和ULMFiT模型。研究结果BERTTimbau是一个在大型葡萄牙语语料库上预先训练的BERT架构模型，尽管其性能仅比使用TF-IDF特征工程的LinearSVC模型高4%，但它为该任务提供了最好的结果。研究限制数据集高度不平衡，就像专利申请中常见的那样，因此样本最少的类预计会表现出最差的性能。这种结果发生在某些情况下，尤其是在训练样本少于60个的课堂上。实际意义专利分类具有挑战性，因为分级分类系统、上下文重叠以及类别代表性不足。然而，考虑到数据集的大小和任务的复杂性，最终模型呈现出可接受的性能。该模型可以通过在第二级ICP中提出一个类别来支持决策并缩短时间，这是授予专利过程的关键阶段之一。独创性/价值据我们所知，所提出的模型从未用于葡萄牙专利分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Use Case of Patent Classification Using Deep Learning with Transfer Learning

Abstract Purpose Patent classification is one of the areas in Intellectual Property Analytics (IPA), and a growing use case since the number of patent applications has been increasing worldwide. We propose using machine learning algorithms to classify Portuguese patents and evaluate the performance of transfer learning methodologies to solve this task. Design/methodology/approach We applied three different approaches in this paper. First, we used a dataset available by INPI to explore traditional machine learning algorithms and ensemble methods. After preprocessing data by applying TF-IDF, FastText and Doc2Vec, the models were evaluated by cross-validation in 5 folds. In a second approach, we used two different Neural Networks architectures, a Convolutional Neural Network (CNN) and a bi-directional Long Short-Term Memory (BiLSTM). Finally, we used pre-trained BERT, DistilBERT, and ULMFiT models in the third approach. Findings BERTTimbau, a BERT architecture model pre-trained on a large Portuguese corpus, presented the best results for the task, even though with a performance of only 4% superior to a LinearSVC model using TF-IDF feature engineering. Research limitations The dataset was highly imbalanced, as usual in patent applications, so the classes with the lowest samples were expected to present the worst performance. That result happened in some cases, especially in classes with less than 60 training samples. Practical implications Patent classification is challenging because of the hierarchical classification system, the context overlap, and the underrepresentation of the classes. However, the final model presented an acceptable performance given the size of the dataset and the task complexity. This model can support the decision and improve the time by proposing a category in the second level of ICP, which is one of the critical phases of the grant patent process. Originality/value To our knowledge, the proposed models were never implemented for Portuguese patent classification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of data and information science (Warsaw, Poland)

自引率

0.00%

发文量