Semantic image segmentation (SiS) plays a fundamental role in a broad variety of computer vision applications, providing key information for the global understanding of an image. This survey is an effort to summarize two decades of research in the field of SiS, where we propose a literature review of solutions starting from early historical methods followed by an overview of more recent deep learning methods including the latest trend of using transformers. We complement the review by discussing particular cases of the weak supervision and side machine learning techniques that can be used to improve the semantic segmentation such as curriculum, incremental or self-supervised learning. State-of-the-art SiS models rely on a large amount of annotated samples, which are more expensive to obtain than labels for tasks such as image classification. Since unlabeled data is instead significantly cheaper to obtain, it is not surprising that Unsupervised Domain Adaptation (UDA) reached a broad success within the semantic segmentation community. Therefore, a second core contribution of this book is to summarize five years of a rapidly growing field, Domain Adaptation for Semantic Image Segmentation (DASiS) which embraces the importance of semantic segmentation itself and a critical need of adapting segmentation models to new environments. In addition to providing a comprehensive survey on DASiS techniques, we unveil also newer trends such as multi-domain learning, domain generalization, domain incremental learning, test-time adaptation and source-free domain adaptation. Finally, we conclude this survey by describing datasets and benchmarks most widely used in SiS and DASiS and briefly discuss related tasks such as instance and panoptic image segmentation, as well as applications such as medical image segmentation.
{"title":"Semantic Image Segmentation: Two Decades of Research","authors":"G. Csurka, Riccardo Volpi, Boris Chidlovskii","doi":"10.1561/0600000095","DOIUrl":"https://doi.org/10.1561/0600000095","url":null,"abstract":"Semantic image segmentation (SiS) plays a fundamental role in a broad variety of computer vision applications, providing key information for the global understanding of an image. This survey is an effort to summarize two decades of research in the field of SiS, where we propose a literature review of solutions starting from early historical methods followed by an overview of more recent deep learning methods including the latest trend of using transformers. We complement the review by discussing particular cases of the weak supervision and side machine learning techniques that can be used to improve the semantic segmentation such as curriculum, incremental or self-supervised learning. State-of-the-art SiS models rely on a large amount of annotated samples, which are more expensive to obtain than labels for tasks such as image classification. Since unlabeled data is instead significantly cheaper to obtain, it is not surprising that Unsupervised Domain Adaptation (UDA) reached a broad success within the semantic segmentation community. Therefore, a second core contribution of this book is to summarize five years of a rapidly growing field, Domain Adaptation for Semantic Image Segmentation (DASiS) which embraces the importance of semantic segmentation itself and a critical need of adapting segmentation models to new environments. In addition to providing a comprehensive survey on DASiS techniques, we unveil also newer trends such as multi-domain learning, domain generalization, domain incremental learning, test-time adaptation and source-free domain adaptation. Finally, we conclude this survey by describing datasets and benchmarks most widely used in SiS and DASiS and briefly discuss related tasks such as instance and panoptic image segmentation, as well as applications such as medical image segmentation.","PeriodicalId":45662,"journal":{"name":"Foundations and Trends in Computer Graphics and Vision","volume":"8 1","pages":"1-162"},"PeriodicalIF":36.5,"publicationDate":"2023-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87903398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning-based Visual Compression","authors":"Ruolei Ji, Lina Karam","doi":"10.1561/0600000101","DOIUrl":"https://doi.org/10.1561/0600000101","url":null,"abstract":"","PeriodicalId":45662,"journal":{"name":"Foundations and Trends in Computer Graphics and Vision","volume":"32 1","pages":"1-112"},"PeriodicalIF":36.5,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75184914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seeing through a turbulent atmosphere has been one of the biggest challenges for ground-to-ground long-range incoherent imaging systems. The literature is very rich that can be dated back to Andrey Kolmogorov in the late 40’s, followed by a series of major developments by David Fried, Robert Noll, among others, during the 60’s and 70’s. However, even though we have a much better understanding of the atmosphere today, there remains a gap from the optics theory to image processing algorithms. In particular, training a deep neural network requires an accurate physical forward model that can synthesize training data at a large scale. Traditional wave propagation simulators are not an option here because they are computationally too expensive --- a 256x256 gray scale image would take several minutes to simulate.
{"title":"Computational Imaging Through Atmospheric Turbulence","authors":"Stanley H. Chan, Nicholas Chimitt","doi":"10.1561/0600000103","DOIUrl":"https://doi.org/10.1561/0600000103","url":null,"abstract":"Seeing through a turbulent atmosphere has been one of the biggest challenges for ground-to-ground long-range incoherent imaging systems. The literature is very rich that can be dated back to Andrey Kolmogorov in the late 40’s, followed by a series of major developments by David Fried, Robert Noll, among others, during the 60’s and 70’s. However, even though we have a much better understanding of the atmosphere today, there remains a gap from the optics theory to image processing algorithms. In particular, training a deep neural network requires an accurate physical forward model that can synthesize training data at a large scale. Traditional wave propagation simulators are not an option here because they are computationally too expensive --- a 256x256 gray scale image would take several minutes to simulate.","PeriodicalId":45662,"journal":{"name":"Foundations and Trends in Computer Graphics and Vision","volume":"192 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135262457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: ($i$) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; ($ii$) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and ($iii$) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.
{"title":"Vision-Language Pre-training: Basics, Recent Advances, and Future Trends","authors":"Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao","doi":"10.48550/arXiv.2210.09263","DOIUrl":"https://doi.org/10.48550/arXiv.2210.09263","url":null,"abstract":"This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: ($i$) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; ($ii$) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and ($iii$) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.","PeriodicalId":45662,"journal":{"name":"Foundations and Trends in Computer Graphics and Vision","volume":"9 1","pages":"163-352"},"PeriodicalIF":36.5,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74656714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Z. Bylinskii, L. Herman, Aaron Hertzmann, Stefanie Hutka, Yile Zhang
Online crowdsourcing platforms have made it increasingly easy to perform evaluations of algorithm outputs with survey questions like"which image is better, A or B?", leading to their proliferation in vision and graphics research papers. Results of these studies are often used as quantitative evidence in support of a paper's contributions. On the one hand we argue that, when conducted hastily as an afterthought, such studies lead to an increase of uninformative, and, potentially, misleading conclusions. On the other hand, in these same communities, user research is underutilized in driving project direction and forecasting user needs and reception. We call for increased attention to both the design and reporting of user studies in computer vision and graphics papers towards (1) improved replicability and (2) improved project direction. Together with this call, we offer an overview of methodologies from user experience research (UXR), human-computer interaction (HCI), and applied perception to increase exposure to the available methodologies and best practices. We discuss foundational user research methods (e.g., needfinding) that are presently underutilized in computer vision and graphics research, but can provide valuable project direction. We provide further pointers to the literature for readers interested in exploring other UXR methodologies. Finally, we describe broader open issues and recommendations for the research community.
{"title":"Towards Better User Studies in Computer Graphics and Vision","authors":"Z. Bylinskii, L. Herman, Aaron Hertzmann, Stefanie Hutka, Yile Zhang","doi":"10.1561/0600000106","DOIUrl":"https://doi.org/10.1561/0600000106","url":null,"abstract":"Online crowdsourcing platforms have made it increasingly easy to perform evaluations of algorithm outputs with survey questions like\"which image is better, A or B?\", leading to their proliferation in vision and graphics research papers. Results of these studies are often used as quantitative evidence in support of a paper's contributions. On the one hand we argue that, when conducted hastily as an afterthought, such studies lead to an increase of uninformative, and, potentially, misleading conclusions. On the other hand, in these same communities, user research is underutilized in driving project direction and forecasting user needs and reception. We call for increased attention to both the design and reporting of user studies in computer vision and graphics papers towards (1) improved replicability and (2) improved project direction. Together with this call, we offer an overview of methodologies from user experience research (UXR), human-computer interaction (HCI), and applied perception to increase exposure to the available methodologies and best practices. We discuss foundational user research methods (e.g., needfinding) that are presently underutilized in computer vision and graphics research, but can provide valuable project direction. We provide further pointers to the literature for readers interested in exploring other UXR methodologies. Finally, we describe broader open issues and recommendations for the research community.","PeriodicalId":45662,"journal":{"name":"Foundations and Trends in Computer Graphics and Vision","volume":"106 1","pages":"201-252"},"PeriodicalIF":36.5,"publicationDate":"2022-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80690753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural compression is the application of neural networks and other machine learning methods to data compression. Recent advances in statistical machine learning have opened up new possibilities for data compression, allowing compression algorithms to be learned end-to-end from data using powerful generative models such as normalizing flows, variational autoencoders, diffusion probabilistic models, and generative adversarial networks. The present article aims to introduce this field of research to a broader machine learning audience by reviewing the necessary background in information theory (e.g., entropy coding, rate-distortion theory) and computer vision (e.g., image quality assessment, perceptual metrics), and providing a curated guide through the essential ideas and methods in the literature thus far.
{"title":"An Introduction to Neural Data Compression","authors":"Yibo Yang, S. Mandt, Lucas Theis","doi":"10.1561/0600000107","DOIUrl":"https://doi.org/10.1561/0600000107","url":null,"abstract":"Neural compression is the application of neural networks and other machine learning methods to data compression. Recent advances in statistical machine learning have opened up new possibilities for data compression, allowing compression algorithms to be learned end-to-end from data using powerful generative models such as normalizing flows, variational autoencoders, diffusion probabilistic models, and generative adversarial networks. The present article aims to introduce this field of research to a broader machine learning audience by reviewing the necessary background in information theory (e.g., entropy coding, rate-distortion theory) and computer vision (e.g., image quality assessment, perceptual metrics), and providing a curated guide through the essential ideas and methods in the literature thus far.","PeriodicalId":45662,"journal":{"name":"Foundations and Trends in Computer Graphics and Vision","volume":"65 1","pages":"113-200"},"PeriodicalIF":36.5,"publicationDate":"2022-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79559107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Learning for Image/Video Restoration and Super-resolution","authors":"A. Tekalp","doi":"10.1561/0600000100","DOIUrl":"https://doi.org/10.1561/0600000100","url":null,"abstract":"","PeriodicalId":45662,"journal":{"name":"Foundations and Trends in Computer Graphics and Vision","volume":"1 1","pages":"1-110"},"PeriodicalIF":36.5,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91158980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Irene Amerini, A. Anagnostopoulos, Luca Maiano, L. R. Celsi
In the last two decades, we have witnessed an immense increase in the use of multimedia content on the internet, for multiple applications ranging from the most innocuous to very critical ones. Naturally, this emergence has given rise to many types of threats posed when this content can be manipulated/used for malicious purposes. For example, fake media can be used to drive personal opinions, ruining the image of a public figure, or for criminal activities such as terrorist propaganda and cyberbullying. The research community has of course moved to counter attack these threats by designing manipulation-detection systems based on a variety of techniques, such as signal processing, statistics, and machine learning. This research and practice activity has given rise to the field of multimedia forensics. The success of deep learning in the last decade has led to its use in multimedia forensics as well. In this survey, we look at the latest trends and deep-learning-based techniques introduced to solve three main questions investigated in the field of multimedia forensics. We begin by examining the manipulations of images and videos produced with editing tools, reporting the deep-learning approaches adopted to Irene Amerini, Aris Anagnostopoulos, Luca Maiano and Lorenzo Ricciardi Celsi (2021), “Deep Learning for Multimedia Forensics”, Foundations and Trends® in Computer Graphics and Vision: Vol. 12, No. 4, pp 309–457. DOI: 10.1561/0600000096. Full text available at: http://dx.doi.org/10.1561/0600000096
{"title":"Deep Learning for Multimedia Forensics","authors":"Irene Amerini, A. Anagnostopoulos, Luca Maiano, L. R. Celsi","doi":"10.1561/0600000096","DOIUrl":"https://doi.org/10.1561/0600000096","url":null,"abstract":"In the last two decades, we have witnessed an immense increase in the use of multimedia content on the internet, for multiple applications ranging from the most innocuous to very critical ones. Naturally, this emergence has given rise to many types of threats posed when this content can be manipulated/used for malicious purposes. For example, fake media can be used to drive personal opinions, ruining the image of a public figure, or for criminal activities such as terrorist propaganda and cyberbullying. The research community has of course moved to counter attack these threats by designing manipulation-detection systems based on a variety of techniques, such as signal processing, statistics, and machine learning. This research and practice activity has given rise to the field of multimedia forensics. The success of deep learning in the last decade has led to its use in multimedia forensics as well. In this survey, we look at the latest trends and deep-learning-based techniques introduced to solve three main questions investigated in the field of multimedia forensics. We begin by examining the manipulations of images and videos produced with editing tools, reporting the deep-learning approaches adopted to Irene Amerini, Aris Anagnostopoulos, Luca Maiano and Lorenzo Ricciardi Celsi (2021), “Deep Learning for Multimedia Forensics”, Foundations and Trends® in Computer Graphics and Vision: Vol. 12, No. 4, pp 309–457. DOI: 10.1561/0600000096. Full text available at: http://dx.doi.org/10.1561/0600000096","PeriodicalId":45662,"journal":{"name":"Foundations and Trends in Computer Graphics and Vision","volume":"34 1","pages":"309-457"},"PeriodicalIF":36.5,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81106697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This monograph is about discrete energy minimization for discrete graphical models. It considers graphical models, or, more precisely, maximum a posteriori inference for graphical models, purely as a combinatorial optimization problem. Modeling, applications, probabilistic interpretations and many other aspects are either ignored here or find their place in examples and remarks only. It covers the integer linear programming formulation of the problem as well as its linear programming, Lagrange and Lagrange decomposition-based relaxations. In particular, it provides a detailed analysis of the polynomially solvable acyclic and submodular problems, along with the corresponding exact optimization methods. Major approximate methods, such as message passing and graph cut techniques are also described and analyzed comprehensively. The monograph can be useful for undergraduate and graduate students studying optimization or graphical models, as well as for experts in optimization who want to have a look into graphical models. To make the monograph suitable for both categories of readers we explicitly separate the mathematical optimization background chapters from those specific to graphical models.
{"title":"Discrete Graphical Models - An Optimization Perspective","authors":"Bogdan Savchynskyy","doi":"10.1561/0600000084","DOIUrl":"https://doi.org/10.1561/0600000084","url":null,"abstract":"This monograph is about discrete energy minimization for discrete graphical models. It considers graphical models, or, more precisely, maximum a posteriori inference for graphical models, purely as a combinatorial optimization problem. Modeling, applications, probabilistic interpretations and many other aspects are either ignored here or find their place in examples and remarks only. It covers the integer linear programming formulation of the problem as well as its linear programming, Lagrange and Lagrange decomposition-based relaxations. In particular, it provides a detailed analysis of the polynomially solvable acyclic and submodular problems, along with the corresponding exact optimization methods. Major approximate methods, such as message passing and graph cut techniques are also described and analyzed comprehensively. The monograph can be useful for undergraduate and graduate students studying optimization or graphical models, as well as for experts in optimization who want to have a look into graphical models. To make the monograph suitable for both categories of readers we explicitly separate the mathematical optimization background chapters from those specific to graphical models.","PeriodicalId":45662,"journal":{"name":"Foundations and Trends in Computer Graphics and Vision","volume":"85 1","pages":"160-429"},"PeriodicalIF":36.5,"publicationDate":"2019-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85411678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}