{"title":"文献领域实体对齐的启发式规则和掩码语言模型的比较","authors":"Dominique Piché, L. Font, A. Zouaq, M. Gagnon","doi":"10.1145/3606699","DOIUrl":null,"url":null,"abstract":"The cultural world offers a staggering amount of rich and varied metadata on cultural heritage, accumulated by governmental, academic and commercial players. However, the variety of involved institutions means that the data is stored in as many complex and often incompatible models and standards, which limits its availability and explorability by the greater public. The adoption of Linked Open Data technologies allows a strong interlinking of these various databases as well as external connections with existing knowledge bases. However, as they often contain references to the same entities, the delicate issue of entity alignment becomes the central challenge, especially in the absence or scarcity of unique global identifiers. To tackle this issue, we explored two approaches, one based on a set of heuristic rules, and one based on masked language models, or MLMs. We compare these two approaches, as well as different variations of MLMs, including some models trained on a different language, and various levels of data cleaning and labeling. Our results show that heuristics are a solid approach, but also that MLM-based entity alignment obtains better performance coupled with the fact that it is robust to the data format, and does not require any form of data preprocessing, which was not the case of the heuristic approach in our experiments.","PeriodicalId":54310,"journal":{"name":"ACM Journal on Computing and Cultural Heritage","volume":"114 1","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2023-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparing Heuristic Rules and Masked Language Models for Entity Alignment in the Literature Domain\",\"authors\":\"Dominique Piché, L. Font, A. Zouaq, M. Gagnon\",\"doi\":\"10.1145/3606699\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The cultural world offers a staggering amount of rich and varied metadata on cultural heritage, accumulated by governmental, academic and commercial players. However, the variety of involved institutions means that the data is stored in as many complex and often incompatible models and standards, which limits its availability and explorability by the greater public. The adoption of Linked Open Data technologies allows a strong interlinking of these various databases as well as external connections with existing knowledge bases. However, as they often contain references to the same entities, the delicate issue of entity alignment becomes the central challenge, especially in the absence or scarcity of unique global identifiers. To tackle this issue, we explored two approaches, one based on a set of heuristic rules, and one based on masked language models, or MLMs. We compare these two approaches, as well as different variations of MLMs, including some models trained on a different language, and various levels of data cleaning and labeling. Our results show that heuristics are a solid approach, but also that MLM-based entity alignment obtains better performance coupled with the fact that it is robust to the data format, and does not require any form of data preprocessing, which was not the case of the heuristic approach in our experiments.\",\"PeriodicalId\":54310,\"journal\":{\"name\":\"ACM Journal on Computing and Cultural Heritage\",\"volume\":\"114 1\",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2023-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Journal on Computing and Cultural Heritage\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3606699\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal on Computing and Cultural Heritage","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3606699","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
Comparing Heuristic Rules and Masked Language Models for Entity Alignment in the Literature Domain
The cultural world offers a staggering amount of rich and varied metadata on cultural heritage, accumulated by governmental, academic and commercial players. However, the variety of involved institutions means that the data is stored in as many complex and often incompatible models and standards, which limits its availability and explorability by the greater public. The adoption of Linked Open Data technologies allows a strong interlinking of these various databases as well as external connections with existing knowledge bases. However, as they often contain references to the same entities, the delicate issue of entity alignment becomes the central challenge, especially in the absence or scarcity of unique global identifiers. To tackle this issue, we explored two approaches, one based on a set of heuristic rules, and one based on masked language models, or MLMs. We compare these two approaches, as well as different variations of MLMs, including some models trained on a different language, and various levels of data cleaning and labeling. Our results show that heuristics are a solid approach, but also that MLM-based entity alignment obtains better performance coupled with the fact that it is robust to the data format, and does not require any form of data preprocessing, which was not the case of the heuristic approach in our experiments.
期刊介绍:
ACM Journal on Computing and Cultural Heritage (JOCCH) publishes papers of significant and lasting value in all areas relating to the use of information and communication technologies (ICT) in support of Cultural Heritage. The journal encourages the submission of manuscripts that demonstrate innovative use of technology for the discovery, analysis, interpretation and presentation of cultural material, as well as manuscripts that illustrate applications in the Cultural Heritage sector that challenge the computational technologies and suggest new research opportunities in computer science.