使用深度学习和迁移学习的专利分类用例

R. Henriques, Adria Ferreira, M. Castelli
{"title":"使用深度学习和迁移学习的专利分类用例","authors":"R. Henriques, Adria Ferreira, M. Castelli","doi":"10.2478/jdis-2022-0015","DOIUrl":null,"url":null,"abstract":"Abstract Purpose Patent classification is one of the areas in Intellectual Property Analytics (IPA), and a growing use case since the number of patent applications has been increasing worldwide. We propose using machine learning algorithms to classify Portuguese patents and evaluate the performance of transfer learning methodologies to solve this task. Design/methodology/approach We applied three different approaches in this paper. First, we used a dataset available by INPI to explore traditional machine learning algorithms and ensemble methods. After preprocessing data by applying TF-IDF, FastText and Doc2Vec, the models were evaluated by cross-validation in 5 folds. In a second approach, we used two different Neural Networks architectures, a Convolutional Neural Network (CNN) and a bi-directional Long Short-Term Memory (BiLSTM). Finally, we used pre-trained BERT, DistilBERT, and ULMFiT models in the third approach. Findings BERTTimbau, a BERT architecture model pre-trained on a large Portuguese corpus, presented the best results for the task, even though with a performance of only 4% superior to a LinearSVC model using TF-IDF feature engineering. Research limitations The dataset was highly imbalanced, as usual in patent applications, so the classes with the lowest samples were expected to present the worst performance. That result happened in some cases, especially in classes with less than 60 training samples. Practical implications Patent classification is challenging because of the hierarchical classification system, the context overlap, and the underrepresentation of the classes. However, the final model presented an acceptable performance given the size of the dataset and the task complexity. This model can support the decision and improve the time by proposing a category in the second level of ICP, which is one of the critical phases of the grant patent process. Originality/value To our knowledge, the proposed models were never implemented for Portuguese patent classification.","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"7 1","pages":"49 - 70"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A Use Case of Patent Classification Using Deep Learning with Transfer Learning\",\"authors\":\"R. Henriques, Adria Ferreira, M. Castelli\",\"doi\":\"10.2478/jdis-2022-0015\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Purpose Patent classification is one of the areas in Intellectual Property Analytics (IPA), and a growing use case since the number of patent applications has been increasing worldwide. We propose using machine learning algorithms to classify Portuguese patents and evaluate the performance of transfer learning methodologies to solve this task. Design/methodology/approach We applied three different approaches in this paper. First, we used a dataset available by INPI to explore traditional machine learning algorithms and ensemble methods. After preprocessing data by applying TF-IDF, FastText and Doc2Vec, the models were evaluated by cross-validation in 5 folds. In a second approach, we used two different Neural Networks architectures, a Convolutional Neural Network (CNN) and a bi-directional Long Short-Term Memory (BiLSTM). Finally, we used pre-trained BERT, DistilBERT, and ULMFiT models in the third approach. Findings BERTTimbau, a BERT architecture model pre-trained on a large Portuguese corpus, presented the best results for the task, even though with a performance of only 4% superior to a LinearSVC model using TF-IDF feature engineering. Research limitations The dataset was highly imbalanced, as usual in patent applications, so the classes with the lowest samples were expected to present the worst performance. That result happened in some cases, especially in classes with less than 60 training samples. Practical implications Patent classification is challenging because of the hierarchical classification system, the context overlap, and the underrepresentation of the classes. However, the final model presented an acceptable performance given the size of the dataset and the task complexity. This model can support the decision and improve the time by proposing a category in the second level of ICP, which is one of the critical phases of the grant patent process. Originality/value To our knowledge, the proposed models were never implemented for Portuguese patent classification.\",\"PeriodicalId\":92237,\"journal\":{\"name\":\"Journal of data and information science (Warsaw, Poland)\",\"volume\":\"7 1\",\"pages\":\"49 - 70\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of data and information science (Warsaw, Poland)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2478/jdis-2022-0015\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of data and information science (Warsaw, Poland)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/jdis-2022-0015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

摘要目的专利分类是知识产权分析(IPA)中的一个领域,自全球专利申请数量不断增加以来,它的使用案例越来越多。我们建议使用机器学习算法对葡萄牙专利进行分类,并评估迁移学习方法的性能以解决这一任务。设计/方法论/方法论我们在本文中采用了三种不同的方法。首先,我们使用INPI提供的数据集来探索传统的机器学习算法和集成方法。在应用TF-IDF、FastText和Doc2Vec对数据进行预处理后,通过交叉验证对模型进行了5次评估。在第二种方法中,我们使用了两种不同的神经网络架构,卷积神经网络(CNN)和双向长短期存储器(BiLSTM)。最后,我们在第三种方法中使用了预训练的BERT、DistilBERT和ULMFiT模型。研究结果BERTTimbau是一个在大型葡萄牙语语料库上预先训练的BERT架构模型,尽管其性能仅比使用TF-IDF特征工程的LinearSVC模型高4%,但它为该任务提供了最好的结果。研究限制数据集高度不平衡,就像专利申请中常见的那样,因此样本最少的类预计会表现出最差的性能。这种结果发生在某些情况下,尤其是在训练样本少于60个的课堂上。实际意义专利分类具有挑战性,因为分级分类系统、上下文重叠以及类别代表性不足。然而,考虑到数据集的大小和任务的复杂性,最终模型呈现出可接受的性能。该模型可以通过在第二级ICP中提出一个类别来支持决策并缩短时间,这是授予专利过程的关键阶段之一。独创性/价值据我们所知,所提出的模型从未用于葡萄牙专利分类。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Use Case of Patent Classification Using Deep Learning with Transfer Learning
Abstract Purpose Patent classification is one of the areas in Intellectual Property Analytics (IPA), and a growing use case since the number of patent applications has been increasing worldwide. We propose using machine learning algorithms to classify Portuguese patents and evaluate the performance of transfer learning methodologies to solve this task. Design/methodology/approach We applied three different approaches in this paper. First, we used a dataset available by INPI to explore traditional machine learning algorithms and ensemble methods. After preprocessing data by applying TF-IDF, FastText and Doc2Vec, the models were evaluated by cross-validation in 5 folds. In a second approach, we used two different Neural Networks architectures, a Convolutional Neural Network (CNN) and a bi-directional Long Short-Term Memory (BiLSTM). Finally, we used pre-trained BERT, DistilBERT, and ULMFiT models in the third approach. Findings BERTTimbau, a BERT architecture model pre-trained on a large Portuguese corpus, presented the best results for the task, even though with a performance of only 4% superior to a LinearSVC model using TF-IDF feature engineering. Research limitations The dataset was highly imbalanced, as usual in patent applications, so the classes with the lowest samples were expected to present the worst performance. That result happened in some cases, especially in classes with less than 60 training samples. Practical implications Patent classification is challenging because of the hierarchical classification system, the context overlap, and the underrepresentation of the classes. However, the final model presented an acceptable performance given the size of the dataset and the task complexity. This model can support the decision and improve the time by proposing a category in the second level of ICP, which is one of the critical phases of the grant patent process. Originality/value To our knowledge, the proposed models were never implemented for Portuguese patent classification.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Editorial board publication strategy and acceptance rates in Turkish national journals Multimodal sentiment analysis for social media contents during public emergencies Perspectives from a publishing ethics and research integrity team for required improvements Build neural network models to identify and correct news headlines exaggerating obesity-related scientific findings An author credit allocation method with improved distinguishability and robustness
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1