Pub Date : 2024-02-15DOI: 10.48550/arXiv.2402.10038
Saeed Khaki, JinJin Li, Lan Ma, Liu Yang, Prathap Ramachandra
Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent. However, proximal policy optimization (PPO) based RLHF is occasionally unstable requiring significant hyperparameter finetuning, and computationally expensive to maximize the estimated reward during alignment. Recently, direct preference optimization (DPO) is proposed to address those challenges. However, DPO relies on contrastive responses generated from human annotator and alternative LLM, instead of the policy model, limiting the effectiveness of the RLHF. In this paper, we addresses both challenges by systematically combining rejection sampling (RS) and DPO. Our proposed method, RS-DPO, initiates with the development of a supervised fine-tuned policy model (SFT). A varied set of k responses per prompt are sampled directly from the SFT model. RS-DPO identifies pairs of contrastive samples based on their reward distribution. Finally, we apply DPO with the contrastive samples to align the model to human preference. Our experiments indicate that our proposed method effectively fine-tunes LLMs with limited resource environments, leading to improved alignment with user intent. Furthermore, it outperforms existing methods, including RS, PPO, and DPO.
{"title":"RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models","authors":"Saeed Khaki, JinJin Li, Lan Ma, Liu Yang, Prathap Ramachandra","doi":"10.48550/arXiv.2402.10038","DOIUrl":"https://doi.org/10.48550/arXiv.2402.10038","url":null,"abstract":"Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent. However, proximal policy optimization (PPO) based RLHF is occasionally unstable requiring significant hyperparameter finetuning, and computationally expensive to maximize the estimated reward during alignment. Recently, direct preference optimization (DPO) is proposed to address those challenges. However, DPO relies on contrastive responses generated from human annotator and alternative LLM, instead of the policy model, limiting the effectiveness of the RLHF. In this paper, we addresses both challenges by systematically combining rejection sampling (RS) and DPO. Our proposed method, RS-DPO, initiates with the development of a supervised fine-tuned policy model (SFT). A varied set of k responses per prompt are sampled directly from the SFT model. RS-DPO identifies pairs of contrastive samples based on their reward distribution. Finally, we apply DPO with the contrastive samples to align the model to human preference. Our experiments indicate that our proposed method effectively fine-tunes LLMs with limited resource environments, leading to improved alignment with user intent. Furthermore, it outperforms existing methods, including RS, PPO, and DPO.","PeriodicalId":8425,"journal":{"name":"ArXiv","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139963040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-15DOI: 10.48550/arXiv.2402.10053
Tianyi Zhou, Stefan Neumann, Kiran Garimella, A. Gionis
Timeline algorithms are key parts of online social networks, but during recent years they have been blamed for increasing polarization and disagreement in our society. Opinion-dynamics models have been used to study a variety of phenomena in online social networks, but an open question remains on how these models can be augmented to take into account the fine-grained impact of user-level timeline algorithms. We make progress on this question by providing a way to model the impact of timeline algorithms on opinion dynamics. Specifically, we show how the popular Friedkin--Johnsen opinion-formation model can be augmented based on aggregate information, extracted from timeline data. We use our model to study the problem of minimizing the polarization and disagreement; we assume that we are allowed to make small changes to the users' timeline compositions by strengthening some topics of discussion and penalizing some others. We present a gradient descent-based algorithm for this problem, and show that under realistic parameter settings, our algorithm computes a $(1+varepsilon)$-approximate solution in time $tilde{O}(msqrt{n} lg(1/varepsilon))$, where $m$ is the number of edges in the graph and $n$ is the number of vertices. We also present an algorithm that provably computes an $varepsilon$-approximation of our model in near-linear time. We evaluate our method on real-world data and show that it effectively reduces the polarization and disagreement in the network. Finally, we release an anonymized graph dataset with ground-truth opinions and more than 27,000 nodes (the previously largest publicly available dataset contains less than 550 nodes).
{"title":"Modeling the Impact of Timeline Algorithms on Opinion Dynamics Using Low-rank Updates","authors":"Tianyi Zhou, Stefan Neumann, Kiran Garimella, A. Gionis","doi":"10.48550/arXiv.2402.10053","DOIUrl":"https://doi.org/10.48550/arXiv.2402.10053","url":null,"abstract":"Timeline algorithms are key parts of online social networks, but during recent years they have been blamed for increasing polarization and disagreement in our society. Opinion-dynamics models have been used to study a variety of phenomena in online social networks, but an open question remains on how these models can be augmented to take into account the fine-grained impact of user-level timeline algorithms. We make progress on this question by providing a way to model the impact of timeline algorithms on opinion dynamics. Specifically, we show how the popular Friedkin--Johnsen opinion-formation model can be augmented based on aggregate information, extracted from timeline data. We use our model to study the problem of minimizing the polarization and disagreement; we assume that we are allowed to make small changes to the users' timeline compositions by strengthening some topics of discussion and penalizing some others. We present a gradient descent-based algorithm for this problem, and show that under realistic parameter settings, our algorithm computes a $(1+varepsilon)$-approximate solution in time $tilde{O}(msqrt{n} lg(1/varepsilon))$, where $m$ is the number of edges in the graph and $n$ is the number of vertices. We also present an algorithm that provably computes an $varepsilon$-approximation of our model in near-linear time. We evaluate our method on real-world data and show that it effectively reduces the polarization and disagreement in the network. Finally, we release an anonymized graph dataset with ground-truth opinions and more than 27,000 nodes (the previously largest publicly available dataset contains less than 550 nodes).","PeriodicalId":8425,"journal":{"name":"ArXiv","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139963072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-15DOI: 10.48550/arXiv.2402.09635
Sedat Ozer, A. P. Ndigande
This paper proposes a deep learning based solution for multi-modal image alignment regarding UAV-taken images. Many recently proposed state-of-the-art alignment techniques rely on using Lucas-Kanade (LK) based solutions for a successful alignment. However, we show that we can achieve state of the art results without using LK-based methods. Our approach carefully utilizes a two-branch based convolutional neural network (CNN) based on feature embedding blocks. We propose two variants of our approach, where in the first variant (ModelA), we directly predict the new coordinates of only the four corners of the image to be aligned; and in the second one (ModelB), we predict the homography matrix directly. Applying alignment on the image corners forces algorithm to match only those four corners as opposed to computing and matching many (key)points, since the latter may cause many outliers, yielding less accurate alignment. We test our proposed approach on four aerial datasets and obtain state of the art results, when compared to the existing recent deep LK-based architectures.
{"title":"VisIRNet: Deep Image Alignment for UAV-taken Visible and Infrared Image Pairs","authors":"Sedat Ozer, A. P. Ndigande","doi":"10.48550/arXiv.2402.09635","DOIUrl":"https://doi.org/10.48550/arXiv.2402.09635","url":null,"abstract":"This paper proposes a deep learning based solution for multi-modal image alignment regarding UAV-taken images. Many recently proposed state-of-the-art alignment techniques rely on using Lucas-Kanade (LK) based solutions for a successful alignment. However, we show that we can achieve state of the art results without using LK-based methods. Our approach carefully utilizes a two-branch based convolutional neural network (CNN) based on feature embedding blocks. We propose two variants of our approach, where in the first variant (ModelA), we directly predict the new coordinates of only the four corners of the image to be aligned; and in the second one (ModelB), we predict the homography matrix directly. Applying alignment on the image corners forces algorithm to match only those four corners as opposed to computing and matching many (key)points, since the latter may cause many outliers, yielding less accurate alignment. We test our proposed approach on four aerial datasets and obtain state of the art results, when compared to the existing recent deep LK-based architectures.","PeriodicalId":8425,"journal":{"name":"ArXiv","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139963259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-15DOI: 10.48550/arXiv.2402.09740
Won-Kwang Park
In this study, we consider the application of orthogonality sampling method (OSM) with single and multiple sources for a fast identification of small objects in limited-aperture inverse scattering problem. We first apply the OSM with single source and show that the indicator function with single source can be expressed by the Bessel function of order zero of the first kind, infinite series of Bessel function of nonzero integer order of the first kind, range of signal receiver, and the location of emitter. Based on this result, we explain that the objects can be identified through the OSM with single source but the identification is significantly influenced by the location of source and applied frequency. For a successful improvement, we then consider the OSM with multiple sources. Based on the identified structure of the OSM with single source, we design an indicator function of the OSM with multiple sources and show that it can be expressed by the square of the Bessel function of order zero of the first kind an infinite series of the square of Bessel function of nonzero integer order of the first kind. Based on the theoretical results, we explain that the objects can be identified uniquely through the designed OSM. Several numerical experiments with experimental data provided by the Institute Fresnel demonstrate the pros and cons of the OSM with single source and how the designed OSM with multiple sources behave.
在本研究中,我们考虑应用单源和多源的正交采样法(OSM)来快速识别有限孔径反向散射问题中的小物体。我们首先应用了单源的正交采样法,结果表明单源的指示函数可以用第一类零阶贝塞尔函数、第一类非零整数阶贝塞尔函数的无穷级数、信号接收器的范围和发射器的位置来表示。基于这一结果,我们解释说,通过单源 OSM 可以识别物体,但识别效果受到源位置和应用频率的显著影响。为了成功改进,我们随后考虑了多源 OSM。根据单源 OSM 的识别结构,我们设计了多源 OSM 的指示函数,并证明它可以用第一类零阶贝塞尔函数的平方和第一类非零整数阶贝塞尔函数平方的无穷级数来表示。基于理论结果,我们解释了通过设计的 OSM 可以唯一地识别物体。利用菲涅尔研究所提供的实验数据进行的几项数值实验证明了单光源 OSM 的优缺点,以及所设计的多光源 OSM 的性能。
{"title":"Inversion of limited-aperture Fresnel experimental data using orthogonality sampling method with single and multiple sources","authors":"Won-Kwang Park","doi":"10.48550/arXiv.2402.09740","DOIUrl":"https://doi.org/10.48550/arXiv.2402.09740","url":null,"abstract":"In this study, we consider the application of orthogonality sampling method (OSM) with single and multiple sources for a fast identification of small objects in limited-aperture inverse scattering problem. We first apply the OSM with single source and show that the indicator function with single source can be expressed by the Bessel function of order zero of the first kind, infinite series of Bessel function of nonzero integer order of the first kind, range of signal receiver, and the location of emitter. Based on this result, we explain that the objects can be identified through the OSM with single source but the identification is significantly influenced by the location of source and applied frequency. For a successful improvement, we then consider the OSM with multiple sources. Based on the identified structure of the OSM with single source, we design an indicator function of the OSM with multiple sources and show that it can be expressed by the square of the Bessel function of order zero of the first kind an infinite series of the square of Bessel function of nonzero integer order of the first kind. Based on the theoretical results, we explain that the objects can be identified uniquely through the designed OSM. Several numerical experiments with experimental data provided by the Institute Fresnel demonstrate the pros and cons of the OSM with single source and how the designed OSM with multiple sources behave.","PeriodicalId":8425,"journal":{"name":"ArXiv","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139963293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Naming is very important in software development, as names are often the only vehicle of meaning about what the code is intended to do. A recent study on how developers choose names collected the names given by different developers for the same objects. This enabled a study of these names' diversity and structure, and the construction of a model of how names are created. We reproduce different parts of this study in three independent experiments. Importantly, we employ methodological variations rather than striving of an exact replication. When the same results are obtained this then boosts our confidence in their validity by demonstrating that they do not depend on the methodology. Our results indeed corroborate those of the original study in terms of the diversity of names, the low probability of two developers choosing the same name, and the finding that experienced developers tend to use slightly longer names than inexperienced students. We explain name diversity by performing a new analysis of the names, classifying the concepts represented in them as universal (agreed upon), alternative (reflecting divergent views on a topic), or optional (reflecting divergent opinions on whether to include this concept at all). This classification enables new research directions concerning the considerations involved in naming decisions. We also show that explicitly using the model proposed in the original study to guide naming leads to the creation of better names, whereas the simpler approach of just asking participants to use longer and more detailed names does not.
{"title":"Reproducing, Extending, and Analyzing Naming Experiments","authors":"Rachel Alpern, Ido Lazer, Issar Tzachor, Hanit Hakim, Sapir Weissbuch, D. Feitelson","doi":"10.48550/arXiv.2402.10022","DOIUrl":"https://doi.org/10.48550/arXiv.2402.10022","url":null,"abstract":"Naming is very important in software development, as names are often the only vehicle of meaning about what the code is intended to do. A recent study on how developers choose names collected the names given by different developers for the same objects. This enabled a study of these names' diversity and structure, and the construction of a model of how names are created. We reproduce different parts of this study in three independent experiments. Importantly, we employ methodological variations rather than striving of an exact replication. When the same results are obtained this then boosts our confidence in their validity by demonstrating that they do not depend on the methodology. Our results indeed corroborate those of the original study in terms of the diversity of names, the low probability of two developers choosing the same name, and the finding that experienced developers tend to use slightly longer names than inexperienced students. We explain name diversity by performing a new analysis of the names, classifying the concepts represented in them as universal (agreed upon), alternative (reflecting divergent views on a topic), or optional (reflecting divergent opinions on whether to include this concept at all). This classification enables new research directions concerning the considerations involved in naming decisions. We also show that explicitly using the model proposed in the original study to guide naming leads to the creation of better names, whereas the simpler approach of just asking participants to use longer and more detailed names does not.","PeriodicalId":8425,"journal":{"name":"ArXiv","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139963328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mary S'anchez-Gord'on, Ricardo Colomo Palacios, Alex Sanchez Gordon
A role model is a person who serves as an example for others to follow, especially in terms of values, behavior, achievements, and personal characteristics. In this paper, authors study how role models influence software practitioners careers, an aspect not studied in the literature before. By means of this study, authors aim to understand if there are any salient role model archetypes and what characteristics are valued by participants in their role models. To do so, authors use a thematic coding approach to analyze the data collected from interviewing ten Latin American software practitioners. Findings reveal that role models were perceived as sources of knowledge, yet the majority of participants, regardless of their career stage, displayed a stronger interest in the human side and the moral values that their role models embodied. This study also shows that any practitioner can be viewed as a role model.
{"title":"Characterizing Role Models in Software Practitioners' Career: An Interview Study","authors":"Mary S'anchez-Gord'on, Ricardo Colomo Palacios, Alex Sanchez Gordon","doi":"10.1145/3641822.3641883","DOIUrl":"https://doi.org/10.1145/3641822.3641883","url":null,"abstract":"A role model is a person who serves as an example for others to follow, especially in terms of values, behavior, achievements, and personal characteristics. In this paper, authors study how role models influence software practitioners careers, an aspect not studied in the literature before. By means of this study, authors aim to understand if there are any salient role model archetypes and what characteristics are valued by participants in their role models. To do so, authors use a thematic coding approach to analyze the data collected from interviewing ten Latin American software practitioners. Findings reveal that role models were perceived as sources of knowledge, yet the majority of participants, regardless of their career stage, displayed a stronger interest in the human side and the moral values that their role models embodied. This study also shows that any practitioner can be viewed as a role model.","PeriodicalId":8425,"journal":{"name":"ArXiv","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139963375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-15DOI: 10.48550/arXiv.2402.09637
Eyad Shtaiwi, Ahmed Abdelhadi, Husheng Li, Zhu Han, H. V. Poor
Sixth-generation (6G) wireless communication systems, as stated in the European 6G flagship project Hexa-X, are anticipated to feature the integration of intelligence, communication, sensing, positioning, and computation. An important aspect of this integration is integrated sensing and communication (ISAC), in which the same waveform is used for both systems both sensing and communication, to address the challenge of spectrum scarcity. Recently, the orthogonal time frequency space (OTFS) waveform has been proposed to address OFDM's limitations due to the high Doppler spread in some future wireless communication systems. In this paper, we review existing OTFS waveforms for ISAC systems and provide some insights into future research. Firstly, we introduce the basic principles and a system model of OTFS and provide a foundational understanding of this innovative technology's core concepts and architecture. Subsequently, we present an overview of OTFS-based ISAC system frameworks. We provide a comprehensive review of recent research developments and the current state of the art in the field of OTFS-assisted ISAC systems to gain a thorough understanding of the current landscape and advancements. Furthermore, we perform a thorough comparison between OTFS-enabled ISAC operations and traditional OFDM, highlighting the distinctive advantages of OTFS, especially in high Doppler spread scenarios. Subsequently, we address the primary challenges facing OTFS-based ISAC systems, identifying potential limitations and drawbacks. Then, finally, we suggest future research directions, aiming to inspire further innovation in the 6G wireless communication landscape.
{"title":"Orthogonal Time Frequency Space for Integrated Sensing and Communication: A Survey","authors":"Eyad Shtaiwi, Ahmed Abdelhadi, Husheng Li, Zhu Han, H. V. Poor","doi":"10.48550/arXiv.2402.09637","DOIUrl":"https://doi.org/10.48550/arXiv.2402.09637","url":null,"abstract":"Sixth-generation (6G) wireless communication systems, as stated in the European 6G flagship project Hexa-X, are anticipated to feature the integration of intelligence, communication, sensing, positioning, and computation. An important aspect of this integration is integrated sensing and communication (ISAC), in which the same waveform is used for both systems both sensing and communication, to address the challenge of spectrum scarcity. Recently, the orthogonal time frequency space (OTFS) waveform has been proposed to address OFDM's limitations due to the high Doppler spread in some future wireless communication systems. In this paper, we review existing OTFS waveforms for ISAC systems and provide some insights into future research. Firstly, we introduce the basic principles and a system model of OTFS and provide a foundational understanding of this innovative technology's core concepts and architecture. Subsequently, we present an overview of OTFS-based ISAC system frameworks. We provide a comprehensive review of recent research developments and the current state of the art in the field of OTFS-assisted ISAC systems to gain a thorough understanding of the current landscape and advancements. Furthermore, we perform a thorough comparison between OTFS-enabled ISAC operations and traditional OFDM, highlighting the distinctive advantages of OTFS, especially in high Doppler spread scenarios. Subsequently, we address the primary challenges facing OTFS-based ISAC systems, identifying potential limitations and drawbacks. Then, finally, we suggest future research directions, aiming to inspire further innovation in the 6G wireless communication landscape.","PeriodicalId":8425,"journal":{"name":"ArXiv","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139963564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-15DOI: 10.48550/arXiv.2402.09722
Stephen Hausler, David Hall, Sutharsan Mahendren, Peyman Moghadam
Neural fields, coordinate-based neural networks, have recently gained popularity for implicitly representing a scene. In contrast to classical methods that are based on explicit representations such as point clouds, neural fields provide a continuous scene representation able to represent 3D geometry and appearance in a way which is compact and ideal for robotics applications. However, limited prior methods have investigated registering multiple neural fields by directly utilising these continuous implicit representations. In this paper, we present Reg-NF, a neural fields-based registration that optimises for the relative 6-DoF transformation between two arbitrary neural fields, even if those two fields have different scale factors. Key components of Reg-NF include a bidirectional registration loss, multi-view surface sampling, and utilisation of volumetric signed distance functions (SDFs). We showcase our approach on a new neural field dataset for evaluating registration problems. We provide an exhaustive set of experiments and ablation studies to identify the performance of our approach, while also discussing limitations to provide future direction to the research community on open challenges in utilizing neural fields in unconstrained environments.
{"title":"Reg-NF: Efficient Registration of Implicit Surfaces within Neural Fields","authors":"Stephen Hausler, David Hall, Sutharsan Mahendren, Peyman Moghadam","doi":"10.48550/arXiv.2402.09722","DOIUrl":"https://doi.org/10.48550/arXiv.2402.09722","url":null,"abstract":"Neural fields, coordinate-based neural networks, have recently gained popularity for implicitly representing a scene. In contrast to classical methods that are based on explicit representations such as point clouds, neural fields provide a continuous scene representation able to represent 3D geometry and appearance in a way which is compact and ideal for robotics applications. However, limited prior methods have investigated registering multiple neural fields by directly utilising these continuous implicit representations. In this paper, we present Reg-NF, a neural fields-based registration that optimises for the relative 6-DoF transformation between two arbitrary neural fields, even if those two fields have different scale factors. Key components of Reg-NF include a bidirectional registration loss, multi-view surface sampling, and utilisation of volumetric signed distance functions (SDFs). We showcase our approach on a new neural field dataset for evaluating registration problems. We provide an exhaustive set of experiments and ablation studies to identify the performance of our approach, while also discussing limitations to provide future direction to the research community on open challenges in utilizing neural fields in unconstrained environments.","PeriodicalId":8425,"journal":{"name":"ArXiv","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139963567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-15DOI: 10.48550/arXiv.2402.09989
Jinyuan Li, Han Li, Di Sun, Jiahao Wang, Wenkun Zhang, Zan Wang, Gang Pan
Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodal task that aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging properties: 1) The weak correlation between image-text pairs in social media results in a significant portion of named entities being ungroundable. 2) There exists a distinction between coarse-grained referring expressions commonly used in similar tasks (e.g., phrase localization, referring expression comprehension) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as a connecting bridge. This reformulation brings two benefits: 1) It maintains the optimal MNER performance and eliminates the need for employing object detection methods to pre-extract regional features, thereby naturally addressing two major limitations of existing GMNER methods. 2) The introduction of entity expansion expression and Visual Entailment (VE) Module unifies Visual Grounding (VG) and Entity Grounding (EG). It enables RiVEG to effortlessly inherit the Visual Entailment and Visual Grounding capabilities of any current or prospective multimodal pretraining models. Extensive experiments demonstrate that RiVEG outperforms state-of-the-art methods on the existing GMNER dataset and achieves absolute leads of 10.65%, 6.21%, and 8.83% in all three subtasks.
{"title":"LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition","authors":"Jinyuan Li, Han Li, Di Sun, Jiahao Wang, Wenkun Zhang, Zan Wang, Gang Pan","doi":"10.48550/arXiv.2402.09989","DOIUrl":"https://doi.org/10.48550/arXiv.2402.09989","url":null,"abstract":"Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodal task that aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging properties: 1) The weak correlation between image-text pairs in social media results in a significant portion of named entities being ungroundable. 2) There exists a distinction between coarse-grained referring expressions commonly used in similar tasks (e.g., phrase localization, referring expression comprehension) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as a connecting bridge. This reformulation brings two benefits: 1) It maintains the optimal MNER performance and eliminates the need for employing object detection methods to pre-extract regional features, thereby naturally addressing two major limitations of existing GMNER methods. 2) The introduction of entity expansion expression and Visual Entailment (VE) Module unifies Visual Grounding (VG) and Entity Grounding (EG). It enables RiVEG to effortlessly inherit the Visual Entailment and Visual Grounding capabilities of any current or prospective multimodal pretraining models. Extensive experiments demonstrate that RiVEG outperforms state-of-the-art methods on the existing GMNER dataset and achieves absolute leads of 10.65%, 6.21%, and 8.83% in all three subtasks.","PeriodicalId":8425,"journal":{"name":"ArXiv","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139962162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-15DOI: 10.48550/arXiv.2402.09984
Ravi Hammond, Dustin Craggs, Mingyu Guo, Jakob Foerster, Ian Reid
In many collaborative settings, artificial intelligence (AI) agents must be able to adapt to new teammates that use unknown or previously unobserved strategies. While often simple for humans, this can be challenging for AI agents. For example, if an AI agent learns to drive alongside others (a training set) that only drive on one side of the road, it may struggle to adapt this experience to coordinate with drivers on the opposite side, even if their behaviours are simply flipped along the left-right symmetry. To address this we introduce symmetry-breaking augmentations (SBA), which increases diversity in the behaviour of training teammates by applying a symmetry-flipping operation. By learning a best-response to the augmented set of teammates, our agent is exposed to a wider range of behavioural conventions, improving performance when deployed with novel teammates. We demonstrate this experimentally in two settings, and show that our approach improves upon previous ad hoc teamwork results in the challenging card game Hanabi. We also propose a general metric for estimating symmetry-dependency amongst a given set of policies.
{"title":"Symmetry-Breaking Augmentations for Ad Hoc Teamwork","authors":"Ravi Hammond, Dustin Craggs, Mingyu Guo, Jakob Foerster, Ian Reid","doi":"10.48550/arXiv.2402.09984","DOIUrl":"https://doi.org/10.48550/arXiv.2402.09984","url":null,"abstract":"In many collaborative settings, artificial intelligence (AI) agents must be able to adapt to new teammates that use unknown or previously unobserved strategies. While often simple for humans, this can be challenging for AI agents. For example, if an AI agent learns to drive alongside others (a training set) that only drive on one side of the road, it may struggle to adapt this experience to coordinate with drivers on the opposite side, even if their behaviours are simply flipped along the left-right symmetry. To address this we introduce symmetry-breaking augmentations (SBA), which increases diversity in the behaviour of training teammates by applying a symmetry-flipping operation. By learning a best-response to the augmented set of teammates, our agent is exposed to a wider range of behavioural conventions, improving performance when deployed with novel teammates. We demonstrate this experimentally in two settings, and show that our approach improves upon previous ad hoc teamwork results in the challenging card game Hanabi. We also propose a general metric for estimating symmetry-dependency amongst a given set of policies.","PeriodicalId":8425,"journal":{"name":"ArXiv","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139962262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}