Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter
{"title":"AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model","authors":"Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter","doi":"arxiv-2409.04073","DOIUrl":null,"url":null,"abstract":"Entity matching (EM) is the problem of determining whether two records refer\nto same real-world entity, which is crucial in data integration, e.g., for\nproduct catalogs or address databases. A major drawback of many EM approaches\nis their dependence on labelled examples. We thus focus on the challenging\nsetting of zero-shot entity matching where no labelled examples are available\nfor an unseen target dataset. Recently, large language models (LLMs) have shown\npromising results for zero-shot EM, but their low throughput and high\ndeployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model\nfine-tuned in a transfer learning setup. We propose several novel data\nselection techniques to generate fine-tuning data for our model, e.g., by\nselecting difficult pairs to match via an AutoML filter, by generating\nadditional attribute-level examples, and by controlling label imbalance in the\ndata. We conduct an extensive evaluation of the prediction quality and deployment\ncost of our model, in a comparison to thirteen baselines on nine benchmark\ndatasets. We find that AnyMatch provides competitive prediction quality despite\nits small parameter size: it achieves the second-highest F1 score overall, and\noutperforms several other approaches that employ models with hundreds of\nbillions of parameters. Furthermore, our approach exhibits major cost benefits:\nthe average prediction quality of AnyMatch is within 4.4% of the\nstate-of-the-art method MatchGPT with the proprietary trillion-parameter model\nGPT-4, yet AnyMatch requires four orders of magnitude less parameters and\nincurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04073","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Entity matching (EM) is the problem of determining whether two records refer
to same real-world entity, which is crucial in data integration, e.g., for
product catalogs or address databases. A major drawback of many EM approaches
is their dependence on labelled examples. We thus focus on the challenging
setting of zero-shot entity matching where no labelled examples are available
for an unseen target dataset. Recently, large language models (LLMs) have shown
promising results for zero-shot EM, but their low throughput and high
deployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model
fine-tuned in a transfer learning setup. We propose several novel data
selection techniques to generate fine-tuning data for our model, e.g., by
selecting difficult pairs to match via an AutoML filter, by generating
additional attribute-level examples, and by controlling label imbalance in the
data. We conduct an extensive evaluation of the prediction quality and deployment
cost of our model, in a comparison to thirteen baselines on nine benchmark
datasets. We find that AnyMatch provides competitive prediction quality despite
its small parameter size: it achieves the second-highest F1 score overall, and
outperforms several other approaches that employ models with hundreds of
billions of parameters. Furthermore, our approach exhibits major cost benefits:
the average prediction quality of AnyMatch is within 4.4% of the
state-of-the-art method MatchGPT with the proprietary trillion-parameter model
GPT-4, yet AnyMatch requires four orders of magnitude less parameters and
incurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).