Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter
{"title":"AnyMatch -- 利用小型语言模型进行高效的零点实体匹配","authors":"Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter","doi":"arxiv-2409.04073","DOIUrl":null,"url":null,"abstract":"Entity matching (EM) is the problem of determining whether two records refer\nto same real-world entity, which is crucial in data integration, e.g., for\nproduct catalogs or address databases. A major drawback of many EM approaches\nis their dependence on labelled examples. We thus focus on the challenging\nsetting of zero-shot entity matching where no labelled examples are available\nfor an unseen target dataset. Recently, large language models (LLMs) have shown\npromising results for zero-shot EM, but their low throughput and high\ndeployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model\nfine-tuned in a transfer learning setup. We propose several novel data\nselection techniques to generate fine-tuning data for our model, e.g., by\nselecting difficult pairs to match via an AutoML filter, by generating\nadditional attribute-level examples, and by controlling label imbalance in the\ndata. We conduct an extensive evaluation of the prediction quality and deployment\ncost of our model, in a comparison to thirteen baselines on nine benchmark\ndatasets. We find that AnyMatch provides competitive prediction quality despite\nits small parameter size: it achieves the second-highest F1 score overall, and\noutperforms several other approaches that employ models with hundreds of\nbillions of parameters. Furthermore, our approach exhibits major cost benefits:\nthe average prediction quality of AnyMatch is within 4.4% of the\nstate-of-the-art method MatchGPT with the proprietary trillion-parameter model\nGPT-4, yet AnyMatch requires four orders of magnitude less parameters and\nincurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model\",\"authors\":\"Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter\",\"doi\":\"arxiv-2409.04073\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Entity matching (EM) is the problem of determining whether two records refer\\nto same real-world entity, which is crucial in data integration, e.g., for\\nproduct catalogs or address databases. A major drawback of many EM approaches\\nis their dependence on labelled examples. We thus focus on the challenging\\nsetting of zero-shot entity matching where no labelled examples are available\\nfor an unseen target dataset. Recently, large language models (LLMs) have shown\\npromising results for zero-shot EM, but their low throughput and high\\ndeployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model\\nfine-tuned in a transfer learning setup. We propose several novel data\\nselection techniques to generate fine-tuning data for our model, e.g., by\\nselecting difficult pairs to match via an AutoML filter, by generating\\nadditional attribute-level examples, and by controlling label imbalance in the\\ndata. We conduct an extensive evaluation of the prediction quality and deployment\\ncost of our model, in a comparison to thirteen baselines on nine benchmark\\ndatasets. We find that AnyMatch provides competitive prediction quality despite\\nits small parameter size: it achieves the second-highest F1 score overall, and\\noutperforms several other approaches that employ models with hundreds of\\nbillions of parameters. Furthermore, our approach exhibits major cost benefits:\\nthe average prediction quality of AnyMatch is within 4.4% of the\\nstate-of-the-art method MatchGPT with the proprietary trillion-parameter model\\nGPT-4, yet AnyMatch requires four orders of magnitude less parameters and\\nincurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).\",\"PeriodicalId\":501123,\"journal\":{\"name\":\"arXiv - CS - Databases\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.04073\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04073","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
实体匹配(EM)是确定两条记录是否指向同一个现实世界实体的问题,这在数据集成(如产品目录或地址数据库)中至关重要。许多 EM 方法的一个主要缺点是依赖于标记示例。因此,我们将重点放在 "零镜头实体匹配 "这一具有挑战性的情境上,在这种情境中,没有标记过的示例可用于未见过的目标数据集。最近,大型语言模型(LLM)在零拍 EM 方面取得了令人满意的结果,但其低吞吐量和高部署成本限制了其适用性和可扩展性。我们利用在迁移学习设置中经过微调的小型语言模型 AnyMatch 重新探讨了零次 EM 问题。我们提出了几种新颖的数据选择技术来为我们的模型生成微调数据,例如,通过 AutoML 过滤器选择难以匹配的配对,生成附加属性级示例,以及控制数据中的标签不平衡。我们在九个基准数据集上与 13 个基线模型进行了比较,对我们模型的预测质量和部署成本进行了广泛评估。我们发现,尽管参数规模较小,AnyMatch 却能提供具有竞争力的预测质量:它获得了第二高的 F1 总分,并超越了其他几种采用千亿参数模型的方法。此外,我们的方法在成本方面也有很大的优势:AnyMatch 的平均预测质量与采用专有万亿参数模型 GPT-4 的最先进方法 MatchGPT 相比,相差不到 4.4%,但 AnyMatch 所需的参数数量却少了四个数量级,推理成本(以每千个代币美元计)也低了 3899 倍。
AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model
Entity matching (EM) is the problem of determining whether two records refer
to same real-world entity, which is crucial in data integration, e.g., for
product catalogs or address databases. A major drawback of many EM approaches
is their dependence on labelled examples. We thus focus on the challenging
setting of zero-shot entity matching where no labelled examples are available
for an unseen target dataset. Recently, large language models (LLMs) have shown
promising results for zero-shot EM, but their low throughput and high
deployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model
fine-tuned in a transfer learning setup. We propose several novel data
selection techniques to generate fine-tuning data for our model, e.g., by
selecting difficult pairs to match via an AutoML filter, by generating
additional attribute-level examples, and by controlling label imbalance in the
data. We conduct an extensive evaluation of the prediction quality and deployment
cost of our model, in a comparison to thirteen baselines on nine benchmark
datasets. We find that AnyMatch provides competitive prediction quality despite
its small parameter size: it achieves the second-highest F1 score overall, and
outperforms several other approaches that employ models with hundreds of
billions of parameters. Furthermore, our approach exhibits major cost benefits:
the average prediction quality of AnyMatch is within 4.4% of the
state-of-the-art method MatchGPT with the proprietary trillion-parameter model
GPT-4, yet AnyMatch requires four orders of magnitude less parameters and
incurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).