Danrun Cao, Oussama Ahmia, Nicolas Béchet, P. Marteau
{"title":"Chinese public procurement document harvesting pipeline","authors":"Danrun Cao, Oussama Ahmia, Nicolas Béchet, P. Marteau","doi":"10.1145/3558100.3563848","DOIUrl":null,"url":null,"abstract":"We present a processing pipeline for Chinese public procurement document harvesting, with the aim of producing strategic data with greater added value. It consists of three micro-modules: data collection, information extraction, database indexing. The information extraction part is implemented through a hybrid system which combines rule-based and machine learning approaches. Rule-based method is used for extracting information with presenting recurring morphological features, such as dates, amounts and contract awardee information. Machine learning method is used for trade detection in the title of procurement documents.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3558100.3563848","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
We present a processing pipeline for Chinese public procurement document harvesting, with the aim of producing strategic data with greater added value. It consists of three micro-modules: data collection, information extraction, database indexing. The information extraction part is implemented through a hybrid system which combines rule-based and machine learning approaches. Rule-based method is used for extracting information with presenting recurring morphological features, such as dates, amounts and contract awardee information. Machine learning method is used for trade detection in the title of procurement documents.