Generative large language models (LLMs) are a promising alternative to pre-trained language models for entity matching due to their high zero-shot performance and their ability to generalize to unseen entities. Existing research on using LLMs for entity matching has focused on prompt engineering and in-context learning. This paper explores the potential of fine-tuning LLMs for entity matching. We analyze fine-tuning along two dimensions: 1) The representation of training examples, where we experiment with adding different types of LLM-generated explanations to the training set, and 2) the selection and generation of training examples using LLMs. In addition to the matching performance on the source dataset, we investigate how fine-tuning affects the model's ability to generalize to other in-domain datasets as well as across topical domains. Our experiments show that fine-tuning significantly improves the performance of the smaller models while the results for the larger models are mixed. Fine-tuning also improves the generalization to in-domain datasets while hurting cross-domain transfer. We show that adding structured explanations to the training set has a positive impact on the performance of three out of four LLMs, while the proposed example selection and generation methods only improve the performance of Llama 3.1 8B while decreasing the performance of GPT-4o Mini.
{"title":"Fine-tuning Large Language Models for Entity Matching","authors":"Aaron Steiner, Ralph Peeters, Christian Bizer","doi":"arxiv-2409.08185","DOIUrl":"https://doi.org/arxiv-2409.08185","url":null,"abstract":"Generative large language models (LLMs) are a promising alternative to\u0000pre-trained language models for entity matching due to their high zero-shot\u0000performance and their ability to generalize to unseen entities. Existing\u0000research on using LLMs for entity matching has focused on prompt engineering\u0000and in-context learning. This paper explores the potential of fine-tuning LLMs\u0000for entity matching. We analyze fine-tuning along two dimensions: 1) The\u0000representation of training examples, where we experiment with adding different\u0000types of LLM-generated explanations to the training set, and 2) the selection\u0000and generation of training examples using LLMs. In addition to the matching\u0000performance on the source dataset, we investigate how fine-tuning affects the\u0000model's ability to generalize to other in-domain datasets as well as across\u0000topical domains. Our experiments show that fine-tuning significantly improves\u0000the performance of the smaller models while the results for the larger models\u0000are mixed. Fine-tuning also improves the generalization to in-domain datasets\u0000while hurting cross-domain transfer. We show that adding structured\u0000explanations to the training set has a positive impact on the performance of\u0000three out of four LLMs, while the proposed example selection and generation\u0000methods only improve the performance of Llama 3.1 8B while decreasing the\u0000performance of GPT-4o Mini.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanan Wang, Shuichiro Haruta, Donghuo Zeng, Julio Vizcarra, Mori Kurokawa
Capturing complex hierarchical human activities, from atomic actions (e.g., picking up one present, moving to the sofa, unwrapping the present) to contextual events (e.g., celebrating Christmas) is crucial for achieving high-performance video question answering (VideoQA). Recent works have expanded multimodal models (e.g., CLIP, LLaVA) to process continuous video sequences, enhancing the model's temporal reasoning capabilities. However, these approaches often fail to capture contextual events that can be decomposed into multiple atomic actions non-continuously distributed over relatively long-term sequences. In this paper, to leverage the spatial visual context representation capability of the CLIP model for obtaining non-continuous visual representations in terms of contextual events in videos, we convert long-term video sequences into a spatial image domain and finetune the multimodal model LLaVA for the VideoQA task. Our approach achieves competitive performance on the STAR task, in particular, with a 78.4% accuracy score, exceeding the current state-of-the-art score by 2.8 points on the NExTQA task.
{"title":"Top-down Activity Representation Learning for Video Question Answering","authors":"Yanan Wang, Shuichiro Haruta, Donghuo Zeng, Julio Vizcarra, Mori Kurokawa","doi":"arxiv-2409.07748","DOIUrl":"https://doi.org/arxiv-2409.07748","url":null,"abstract":"Capturing complex hierarchical human activities, from atomic actions (e.g.,\u0000picking up one present, moving to the sofa, unwrapping the present) to\u0000contextual events (e.g., celebrating Christmas) is crucial for achieving\u0000high-performance video question answering (VideoQA). Recent works have expanded\u0000multimodal models (e.g., CLIP, LLaVA) to process continuous video sequences,\u0000enhancing the model's temporal reasoning capabilities. However, these\u0000approaches often fail to capture contextual events that can be decomposed into\u0000multiple atomic actions non-continuously distributed over relatively long-term\u0000sequences. In this paper, to leverage the spatial visual context representation\u0000capability of the CLIP model for obtaining non-continuous visual\u0000representations in terms of contextual events in videos, we convert long-term\u0000video sequences into a spatial image domain and finetune the multimodal model\u0000LLaVA for the VideoQA task. Our approach achieves competitive performance on\u0000the STAR task, in particular, with a 78.4% accuracy score, exceeding the\u0000current state-of-the-art score by 2.8 points on the NExTQA task.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonathan Li, Rohan Bhambhoria, Samuel Dahan, Xiaodan Zhu
Generative AI models, such as the GPT and Llama series, have significant potential to assist laypeople in answering legal questions. However, little prior work focuses on the data sourcing, inference, and evaluation of these models in the context of laypersons. To this end, we propose a human-centric legal NLP pipeline, covering data sourcing, inference, and evaluation. We introduce and release a dataset, LegalQA, with real and specific legal questions spanning from employment law to criminal law, corresponding answers written by legal experts, and citations for each answer. We develop an automatic evaluation protocol for this dataset, then show that retrieval-augmented generation from only 850 citations in the train set can match or outperform internet-wide retrieval, despite containing 9 orders of magnitude less data. Finally, we propose future directions for open-sourced efforts, which fall behind closed-sourced models.
{"title":"Experimenting with Legal AI Solutions: The Case of Question-Answering for Access to Justice","authors":"Jonathan Li, Rohan Bhambhoria, Samuel Dahan, Xiaodan Zhu","doi":"arxiv-2409.07713","DOIUrl":"https://doi.org/arxiv-2409.07713","url":null,"abstract":"Generative AI models, such as the GPT and Llama series, have significant\u0000potential to assist laypeople in answering legal questions. However, little\u0000prior work focuses on the data sourcing, inference, and evaluation of these\u0000models in the context of laypersons. To this end, we propose a human-centric\u0000legal NLP pipeline, covering data sourcing, inference, and evaluation. We\u0000introduce and release a dataset, LegalQA, with real and specific legal\u0000questions spanning from employment law to criminal law, corresponding answers\u0000written by legal experts, and citations for each answer. We develop an\u0000automatic evaluation protocol for this dataset, then show that\u0000retrieval-augmented generation from only 850 citations in the train set can\u0000match or outperform internet-wide retrieval, despite containing 9 orders of\u0000magnitude less data. Finally, we propose future directions for open-sourced\u0000efforts, which fall behind closed-sourced models.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tal BaumelAri, Andre ManoelAri, Daniel JonesAri, Shize SuAri, Huseyin InanAri, AaronAri, Bornstein, Robert Sim
In the field of machine learning, domain-specific annotated data is an invaluable resource for training effective models. However, in the medical domain, this data often includes Personal Health Information (PHI), raising significant privacy concerns. The stringent regulations surrounding PHI limit the availability and sharing of medical datasets, which poses a substantial challenge for researchers and practitioners aiming to develop advanced machine learning models. In this paper, we introduce a novel method to "clone" datasets containing PHI. Our approach ensures that the cloned datasets retain the essential characteristics and utility of the original data without compromising patient privacy. By leveraging differential-privacy techniques and a novel fine-tuning task, our method produces datasets that are free from identifiable information while preserving the statistical properties necessary for model training. We conduct utility testing to evaluate the performance of machine learning models trained on the cloned datasets. The results demonstrate that our cloned datasets not only uphold privacy standards but also enhance model performance compared to those trained on traditional anonymized datasets. This work offers a viable solution for the ethical and effective utilization of sensitive medical data in machine learning, facilitating progress in medical research and the development of robust predictive models.
{"title":"Controllable Synthetic Clinical Note Generation with Privacy Guarantees","authors":"Tal BaumelAri, Andre ManoelAri, Daniel JonesAri, Shize SuAri, Huseyin InanAri, AaronAri, Bornstein, Robert Sim","doi":"arxiv-2409.07809","DOIUrl":"https://doi.org/arxiv-2409.07809","url":null,"abstract":"In the field of machine learning, domain-specific annotated data is an\u0000invaluable resource for training effective models. However, in the medical\u0000domain, this data often includes Personal Health Information (PHI), raising\u0000significant privacy concerns. The stringent regulations surrounding PHI limit\u0000the availability and sharing of medical datasets, which poses a substantial\u0000challenge for researchers and practitioners aiming to develop advanced machine\u0000learning models. In this paper, we introduce a novel method to \"clone\" datasets\u0000containing PHI. Our approach ensures that the cloned datasets retain the\u0000essential characteristics and utility of the original data without compromising\u0000patient privacy. By leveraging differential-privacy techniques and a novel\u0000fine-tuning task, our method produces datasets that are free from identifiable\u0000information while preserving the statistical properties necessary for model\u0000training. We conduct utility testing to evaluate the performance of machine\u0000learning models trained on the cloned datasets. The results demonstrate that\u0000our cloned datasets not only uphold privacy standards but also enhance model\u0000performance compared to those trained on traditional anonymized datasets. This\u0000work offers a viable solution for the ethical and effective utilization of\u0000sensitive medical data in machine learning, facilitating progress in medical\u0000research and the development of robust predictive models.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We report the development of Ruri, a series of Japanese general text embedding models. While the development of general-purpose text embedding models in English and multilingual contexts has been active in recent years, model development in Japanese remains insufficient. The primary reasons for this are the lack of datasets and the absence of necessary expertise. In this report, we provide a detailed account of the development process of Ruri. Specifically, we discuss the training of embedding models using synthesized datasets generated by LLMs, the construction of the reranker for dataset filtering and knowledge distillation, and the performance evaluation of the resulting general-purpose text embedding models.
{"title":"Ruri: Japanese General Text Embeddings","authors":"Hayato Tsukagoshi, Ryohei Sasano","doi":"arxiv-2409.07737","DOIUrl":"https://doi.org/arxiv-2409.07737","url":null,"abstract":"We report the development of Ruri, a series of Japanese general text\u0000embedding models. While the development of general-purpose text embedding\u0000models in English and multilingual contexts has been active in recent years,\u0000model development in Japanese remains insufficient. The primary reasons for\u0000this are the lack of datasets and the absence of necessary expertise. In this\u0000report, we provide a detailed account of the development process of Ruri.\u0000Specifically, we discuss the training of embedding models using synthesized\u0000datasets generated by LLMs, the construction of the reranker for dataset\u0000filtering and knowledge distillation, and the performance evaluation of the\u0000resulting general-purpose text embedding models.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhengliang Liu, Yiwei Li, Oleksandra Zolotarevych, Rongwei Yang, Tianming Liu
Large language models have demonstrated remarkable capabilities in natural language processing, yet their application to political discourse analysis remains underexplored. This paper introduces a novel approach to evaluating presidential debate performances using LLMs, addressing the longstanding challenge of objectively assessing debate outcomes. We propose a framework that analyzes candidates' "Policies, Persona, and Perspective" (3P) and how they resonate with the "Interests, Ideologies, and Identity" (3I) of four key audience groups: voters, businesses, donors, and politicians. Our method employs large language models to generate the LLM-POTUS Score, a quantitative measure of debate performance based on the alignment between 3P and 3I. We apply this framework to analyze transcripts from recent U.S. presidential debates, demonstrating its ability to provide nuanced, multi-dimensional assessments of candidate performances. Our results reveal insights into the effectiveness of different debating strategies and their impact on various audience segments. This study not only offers a new tool for political analysis but also explores the potential and limitations of using LLMs as impartial judges in complex social contexts. In addition, this framework provides individual citizens with an independent tool to evaluate presidential debate performances, which enhances democratic engagement and reduces reliance on potentially biased media interpretations and institutional influence, thereby strengthening the foundation of informed civic participation.
{"title":"LLM-POTUS Score: A Framework of Analyzing Presidential Debates with Large Language Models","authors":"Zhengliang Liu, Yiwei Li, Oleksandra Zolotarevych, Rongwei Yang, Tianming Liu","doi":"arxiv-2409.08147","DOIUrl":"https://doi.org/arxiv-2409.08147","url":null,"abstract":"Large language models have demonstrated remarkable capabilities in natural\u0000language processing, yet their application to political discourse analysis\u0000remains underexplored. This paper introduces a novel approach to evaluating\u0000presidential debate performances using LLMs, addressing the longstanding\u0000challenge of objectively assessing debate outcomes. We propose a framework that\u0000analyzes candidates' \"Policies, Persona, and Perspective\" (3P) and how they\u0000resonate with the \"Interests, Ideologies, and Identity\" (3I) of four key\u0000audience groups: voters, businesses, donors, and politicians. Our method\u0000employs large language models to generate the LLM-POTUS Score, a quantitative\u0000measure of debate performance based on the alignment between 3P and 3I. We\u0000apply this framework to analyze transcripts from recent U.S. presidential\u0000debates, demonstrating its ability to provide nuanced, multi-dimensional\u0000assessments of candidate performances. Our results reveal insights into the\u0000effectiveness of different debating strategies and their impact on various\u0000audience segments. This study not only offers a new tool for political analysis\u0000but also explores the potential and limitations of using LLMs as impartial\u0000judges in complex social contexts. In addition, this framework provides\u0000individual citizens with an independent tool to evaluate presidential debate\u0000performances, which enhances democratic engagement and reduces reliance on\u0000potentially biased media interpretations and institutional influence, thereby\u0000strengthening the foundation of informed civic participation.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automated speaking assessment in conversation tests (ASAC) aims to evaluate the overall speaking proficiency of an L2 (second-language) speaker in a setting where an interlocutor interacts with one or more candidates. Although prior ASAC approaches have shown promising performance on their respective datasets, there is still a dearth of research specifically focused on incorporating the coherence of the logical flow within a conversation into the grading model. To address this critical challenge, we propose a hierarchical graph model that aptly incorporates both broad inter-response interactions (e.g., discourse relations) and nuanced semantic information (e.g., semantic words and speaker intents), which is subsequently fused with contextual information for the final prediction. Extensive experimental results on the NICT-JLE benchmark dataset suggest that our proposed modeling approach can yield considerable improvements in prediction accuracy with respect to various assessment metrics, as compared to some strong baselines. This also sheds light on the importance of investigating coherence-related facets of spoken responses in ASAC.
{"title":"Automated Speaking Assessment of Conversation Tests with Novel Graph-based Modeling on Spoken Response Coherence","authors":"Jiun-Ting Li, Bi-Cheng Yan, Tien-Hong Lo, Yi-Cheng Wang, Yung-Chang Hsu, Berlin Chen","doi":"arxiv-2409.07064","DOIUrl":"https://doi.org/arxiv-2409.07064","url":null,"abstract":"Automated speaking assessment in conversation tests (ASAC) aims to evaluate\u0000the overall speaking proficiency of an L2 (second-language) speaker in a\u0000setting where an interlocutor interacts with one or more candidates. Although\u0000prior ASAC approaches have shown promising performance on their respective\u0000datasets, there is still a dearth of research specifically focused on\u0000incorporating the coherence of the logical flow within a conversation into the\u0000grading model. To address this critical challenge, we propose a hierarchical\u0000graph model that aptly incorporates both broad inter-response interactions\u0000(e.g., discourse relations) and nuanced semantic information (e.g., semantic\u0000words and speaker intents), which is subsequently fused with contextual\u0000information for the final prediction. Extensive experimental results on the\u0000NICT-JLE benchmark dataset suggest that our proposed modeling approach can\u0000yield considerable improvements in prediction accuracy with respect to various\u0000assessment metrics, as compared to some strong baselines. This also sheds light\u0000on the importance of investigating coherence-related facets of spoken responses\u0000in ASAC.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The dissemination of Large Language Models (LLMs), trained at scale, and endowed with powerful text-generating abilities has vastly increased the threats posed by generative AI technologies by reducing the cost of producing harmful, toxic, faked or forged content. In response, various proposals have been made to automatically discriminate artificially generated from human-written texts, typically framing the problem as a classification problem. Most approaches evaluate an input document by a well-chosen detector LLM, assuming that low-perplexity scores reliably signal machine-made content. As using one single detector can induce brittleness of performance, we instead consider several and derive a new, theoretically grounded approach to combine their respective strengths. Our experiments, using a variety of generator LLMs, suggest that our method effectively increases the robustness of detection.
{"title":"Zero-Shot Machine-Generated Text Detection Using Mixture of Large Language Models","authors":"Matthieu Dubois, François Yvon, Pablo Piantanida","doi":"arxiv-2409.07615","DOIUrl":"https://doi.org/arxiv-2409.07615","url":null,"abstract":"The dissemination of Large Language Models (LLMs), trained at scale, and\u0000endowed with powerful text-generating abilities has vastly increased the\u0000threats posed by generative AI technologies by reducing the cost of producing\u0000harmful, toxic, faked or forged content. In response, various proposals have\u0000been made to automatically discriminate artificially generated from\u0000human-written texts, typically framing the problem as a classification problem.\u0000Most approaches evaluate an input document by a well-chosen detector LLM,\u0000assuming that low-perplexity scores reliably signal machine-made content. As\u0000using one single detector can induce brittleness of performance, we instead\u0000consider several and derive a new, theoretically grounded approach to combine\u0000their respective strengths. Our experiments, using a variety of generator LLMs,\u0000suggest that our method effectively increases the robustness of detection.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"70 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hanyu Zhao, Li Du, Yiming Ju, Chengwei Wu, Tengfei Pan
With the availability of various instruction datasets, a pivotal challenge is how to effectively select and integrate these instructions to fine-tune large language models (LLMs). Previous research mainly focuses on selecting individual high-quality instructions. However, these works overlooked the joint interactions and dependencies between different categories of instructions, leading to suboptimal selection strategies. Moreover, the nature of these interaction patterns remains largely unexplored, let alone optimize the instruction set with regard to them. To fill these gaps, in this paper, we: (1) systemically investigate interaction and dependency patterns between different categories of instructions, (2) manage to optimize the instruction set concerning the interaction patterns using a linear programming-based method, and optimize the learning schema of SFT using an instruction dependency taxonomy guided curriculum learning. Experimental results across different LLMs demonstrate improved performance over strong baselines on widely adopted benchmarks.
{"title":"Beyond IID: Optimizing Instruction Learning from the Perspective of Instruction Interaction and Dependency","authors":"Hanyu Zhao, Li Du, Yiming Ju, Chengwei Wu, Tengfei Pan","doi":"arxiv-2409.07045","DOIUrl":"https://doi.org/arxiv-2409.07045","url":null,"abstract":"With the availability of various instruction datasets, a pivotal challenge is\u0000how to effectively select and integrate these instructions to fine-tune large\u0000language models (LLMs). Previous research mainly focuses on selecting\u0000individual high-quality instructions. However, these works overlooked the joint\u0000interactions and dependencies between different categories of instructions,\u0000leading to suboptimal selection strategies. Moreover, the nature of these\u0000interaction patterns remains largely unexplored, let alone optimize the\u0000instruction set with regard to them. To fill these gaps, in this paper, we: (1)\u0000systemically investigate interaction and dependency patterns between different\u0000categories of instructions, (2) manage to optimize the instruction set\u0000concerning the interaction patterns using a linear programming-based method,\u0000and optimize the learning schema of SFT using an instruction dependency\u0000taxonomy guided curriculum learning. Experimental results across different LLMs\u0000demonstrate improved performance over strong baselines on widely adopted\u0000benchmarks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"106 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Praveen K Kanithi, Clément Christophe, Marco AF Pimentel, Tathagata Raha, Nada Saadi, Hamza Javed, Svetlana Maslenkova, Nasir Hayat, Ronnie Rajan, Shadab Khan
The rapid development of Large Language Models (LLMs) for healthcare applications has spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to better reflect real-world performance. While real-world assessments are valuable indicators of utility, they often lag behind the pace of LLM evolution, likely rendering findings obsolete upon deployment. This temporal disconnect necessitates a comprehensive upfront evaluation that can guide model selection for specific clinical applications. We introduce MEDIC, a framework assessing LLMs across five critical dimensions of clinical competence: medical reasoning, ethics and bias, data and language understanding, in-context learning, and clinical safety. MEDIC features a novel cross-examination framework quantifying LLM performance across areas like coverage and hallucination detection, without requiring reference outputs. We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks. Our results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths, such as low hallucination or lower cost of inference. MEDIC's multifaceted evaluation reveals these performance trade-offs, bridging the gap between theoretical capabilities and practical implementation in healthcare settings, ensuring that the most promising models are identified and adapted for diverse healthcare applications.
{"title":"MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications","authors":"Praveen K Kanithi, Clément Christophe, Marco AF Pimentel, Tathagata Raha, Nada Saadi, Hamza Javed, Svetlana Maslenkova, Nasir Hayat, Ronnie Rajan, Shadab Khan","doi":"arxiv-2409.07314","DOIUrl":"https://doi.org/arxiv-2409.07314","url":null,"abstract":"The rapid development of Large Language Models (LLMs) for healthcare\u0000applications has spurred calls for holistic evaluation beyond frequently-cited\u0000benchmarks like USMLE, to better reflect real-world performance. While\u0000real-world assessments are valuable indicators of utility, they often lag\u0000behind the pace of LLM evolution, likely rendering findings obsolete upon\u0000deployment. This temporal disconnect necessitates a comprehensive upfront\u0000evaluation that can guide model selection for specific clinical applications.\u0000We introduce MEDIC, a framework assessing LLMs across five critical dimensions\u0000of clinical competence: medical reasoning, ethics and bias, data and language\u0000understanding, in-context learning, and clinical safety. MEDIC features a novel\u0000cross-examination framework quantifying LLM performance across areas like\u0000coverage and hallucination detection, without requiring reference outputs. We\u0000apply MEDIC to evaluate LLMs on medical question-answering, safety,\u0000summarization, note generation, and other tasks. Our results show performance\u0000disparities across model sizes, baseline vs medically finetuned models, and\u0000have implications on model selection for applications requiring specific model\u0000strengths, such as low hallucination or lower cost of inference. MEDIC's\u0000multifaceted evaluation reveals these performance trade-offs, bridging the gap\u0000between theoretical capabilities and practical implementation in healthcare\u0000settings, ensuring that the most promising models are identified and adapted\u0000for diverse healthcare applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}