Pub Date : 2026-01-09DOI: 10.1007/s11263-025-02654-6
Antonio Ríos-Vila, Jorge Calvo-Zaragoza, David Rizo, Thierry Paquet
Optical Music Recognition (OMR) has made significant progress since its inception, with various approaches now capable of accurately transcribing music scores into digital formats. Despite these advancements, most so-called end-to-end OMR approaches still rely on multi-stage processing pipelines for transcribing full-page score images, which entails challenges such as the need for dedicated layout analysis and specific annotated data, thereby limiting the general applicability of such methods. In this paper, we present the first truly end-to-end approach for page-level OMR in complex layouts. Our system, which combines convolutional layers with autoregressive Transformers, processes an entire music score page and outputs a complete transcription in a music encoding format. This is made possible by both the architecture and the training procedure, which utilizes curriculum learning through incremental synthetic data generation. We evaluate the proposed system using pianoform corpora, which is one of the most complex sources in the OMR literature. This evaluation is conducted first in a controlled scenario with synthetic data, and subsequently against two real-world corpora of varying conditions. Our approach is compared with leading commercial OMR software. The results demonstrate that our system not only successfully transcribes full-page music scores but also outperforms the commercial tool in both zero-shot settings and after fine-tuning with the target domain, representing a significant contribution to the field of OMR.
{"title":"End-to-End Full-Page Optical Music Recognition for Pianoform Sheet Music","authors":"Antonio Ríos-Vila, Jorge Calvo-Zaragoza, David Rizo, Thierry Paquet","doi":"10.1007/s11263-025-02654-6","DOIUrl":"https://doi.org/10.1007/s11263-025-02654-6","url":null,"abstract":"Optical Music Recognition (OMR) has made significant progress since its inception, with various approaches now capable of accurately transcribing music scores into digital formats. Despite these advancements, most so-called <jats:italic>end-to-end</jats:italic> OMR approaches still rely on multi-stage processing pipelines for transcribing full-page score images, which entails challenges such as the need for dedicated layout analysis and specific annotated data, thereby limiting the general applicability of such methods. In this paper, we present the first truly end-to-end approach for page-level OMR in complex layouts. Our system, which combines convolutional layers with autoregressive Transformers, processes an entire music score page and outputs a complete transcription in a music encoding format. This is made possible by both the architecture and the training procedure, which utilizes curriculum learning through incremental synthetic data generation. We evaluate the proposed system using pianoform corpora, which is one of the most complex sources in the OMR literature. This evaluation is conducted first in a controlled scenario with synthetic data, and subsequently against two real-world corpora of varying conditions. Our approach is compared with leading commercial OMR software. The results demonstrate that our system not only successfully transcribes full-page music scores but also outperforms the commercial tool in both zero-shot settings and after fine-tuning with the target domain, representing a significant contribution to the field of OMR.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"3 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145947208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-09DOI: 10.1007/s11263-025-02590-5
Jungmyung Wi, Youngkyun Jang, Dujin Lee, Myeongseok Nam, Donghyun Kim
{"title":"Delving into Pre-training for Domain Transfer: A Broad Study of Pre-training for Domain Generalization and Domain Adaptation","authors":"Jungmyung Wi, Youngkyun Jang, Dujin Lee, Myeongseok Nam, Donghyun Kim","doi":"10.1007/s11263-025-02590-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02590-5","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"29 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145947209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Estimating the relative pose between two cameras is a fundamental step in many applications such as Structure-from-Motion. The common approach to relative pose estimation is to apply a minimal solver inside a RANSAC loop. Highly efficient solvers exist for pinhole cameras. Yet, (nearly) all cameras exhibit radial distortion. Not modeling radial distortion leads to (significantly) worse results. However, minimal radial distortion solvers are significantly more complex than pinhole solvers, both in terms of run-time and implementation efforts. This paper compares radial distortion solvers with two simple-to-implement approaches that do not use minimal radial distortion solvers: The first approach combines an efficient pinhole solver with sampled radial undistortion parameters, where the sampled parameters are used for undistortion prior to applying the pinhole solver. The second approach uses a state-of-the-art neural network to estimate the distortion parameters rather than sampling them from a set of potential values. Extensive experiments on multiple datasets, and different camera setups, show that complex minimal radial distortion solvers are not necessary in practice. We discuss under which conditions a simple sampling of radial undistortion parameters is preferable over calibrating cameras using a learning-based prior approach. Code and newly created benchmark for relative pose estimation under radial distortion are available at https://github.com/kocurvik/rdnet .
{"title":"Are Minimal Radial Distortion Solvers Really Necessary for Relative Pose Estimation?","authors":"Viktor Kocur, Charalambos Tzamos, Yaqing Ding, Zuzana Berger Haladova, Torsten Sattler, Zuzana Kukelova","doi":"10.1007/s11263-025-02657-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02657-3","url":null,"abstract":"Estimating the relative pose between two cameras is a fundamental step in many applications such as Structure-from-Motion. The common approach to relative pose estimation is to apply a minimal solver inside a RANSAC loop. Highly efficient solvers exist for pinhole cameras. Yet, (nearly) all cameras exhibit radial distortion. Not modeling radial distortion leads to (significantly) worse results. However, minimal radial distortion solvers are significantly more complex than pinhole solvers, both in terms of run-time and implementation efforts. This paper compares radial distortion solvers with two simple-to-implement approaches that do not use minimal radial distortion solvers: The first approach combines an efficient pinhole solver with sampled radial undistortion parameters, where the sampled parameters are used for undistortion prior to applying the pinhole solver. The second approach uses a state-of-the-art neural network to estimate the distortion parameters rather than sampling them from a set of potential values. Extensive experiments on multiple datasets, and different camera setups, show that complex minimal radial distortion solvers are not necessary in practice. We discuss under which conditions a simple sampling of radial undistortion parameters is preferable over calibrating cameras using a learning-based prior approach. Code and newly created benchmark for relative pose estimation under radial distortion are available at <jats:ext-link xmlns:xlink=\"http://www.w3.org/1999/xlink\" xlink:href=\"https://github.com/kocurvik/rdnet\" ext-link-type=\"uri\">https://github.com/kocurvik/rdnet</jats:ext-link> .","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"48 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145947210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-07DOI: 10.1007/s11263-025-02653-7
Cheng Da, Peng Wang, Cong Yao
{"title":"Multi-Granularity Prediction with Learnable Fusion for Scene Text Recognition","authors":"Cheng Da, Peng Wang, Cong Yao","doi":"10.1007/s11263-025-02653-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02653-7","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"82 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145947212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-07DOI: 10.1007/s11263-025-02649-3
Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Bin Liu, Kai Chen
{"title":"FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds","authors":"Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Bin Liu, Kai Chen","doi":"10.1007/s11263-025-02649-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02649-3","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"253 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145947213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-06DOI: 10.1007/s11263-025-02595-0
Oguzhan Ulucan, Diclehan Ulucan, Marc Ebner
The human visual system achieves color constancy, allowing consistent color perception under varying environmental contexts, while also being deceived by color illusions, where contextual information affects our perception. Despite the close relationship between color constancy and color illusions, and their potential benefits to the field, both phenomena are rarely studied together in computer vision. In this study, we present the benefits of considering color illusions in the field of computer vision. Particularly, we introduce a learning-free method, namely multiresolution color constancy , which combines insights from computational neuroscience and computer vision to address both phenomena within a single framework. Our approach performs color constancy in both multi- and single-illuminant scenarios, while it is also deceived by assimilation illusions. Additionally, we extend our method to low-light image enhancement, thus, demonstrate its usability across different computer vision tasks. Through comprehensive experiments on color constancy, we show the effectiveness of our method in multi-illuminant and single-illuminant scenarios. Furthermore, we compare our method with state-of-the-art learning-based models on low-light image enhancement, where it shows competitive performance. This work presents the first method that integrates color constancy, color illusions, and low-light image enhancement in a single and explainable framework.
{"title":"A Traditional Approach for Color Constancy and Color Assimilation Illusions with Its Applications to Low-Light Image Enhancement","authors":"Oguzhan Ulucan, Diclehan Ulucan, Marc Ebner","doi":"10.1007/s11263-025-02595-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02595-0","url":null,"abstract":"The human visual system achieves color constancy, allowing consistent color perception under varying environmental contexts, while also being deceived by color illusions, where contextual information affects our perception. Despite the close relationship between color constancy and color illusions, and their potential benefits to the field, both phenomena are rarely studied together in computer vision. In this study, we present the benefits of considering color illusions in the field of computer vision. Particularly, we introduce a learning-free method, namely <jats:italic>multiresolution color constancy</jats:italic> , which combines insights from computational neuroscience and computer vision to address both phenomena within a single framework. Our approach performs color constancy in both multi- and single-illuminant scenarios, while it is also deceived by assimilation illusions. Additionally, we extend our method to low-light image enhancement, thus, demonstrate its usability across different computer vision tasks. Through comprehensive experiments on color constancy, we show the effectiveness of our method in multi-illuminant and single-illuminant scenarios. Furthermore, we compare our method with state-of-the-art learning-based models on low-light image enhancement, where it shows competitive performance. This work presents the first method that integrates color constancy, color illusions, and low-light image enhancement in a single and explainable framework.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"83 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145902469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}