Machine Learning最新文献_第4页

XAI-TRIS: non-linear image benchmarks to quantify false positive post-hoc attribution of feature importance XAI-TRIS：非线性图像基准，用于量化特征重要性的假阳性事后归因

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-16 DOI: 10.1007/s10994-024-06574-3

Benedict Clark, Rick Wilming, Stefan Haufe

The field of ‘explainable’ artificial intelligence (XAI) has produced highly acclaimed methods that seek to make the decisions of complex machine learning (ML) methods ‘understandable’ to humans, for example by attributing ‘importance’ scores to input features. Yet, a lack of formal underpinning leaves it unclear as to what conclusions can safely be drawn from the results of a given XAI method and has also so far hindered the theoretical verification and empirical validation of XAI methods. This means that challenging non-linear problems, typically solved by deep neural networks, presently lack appropriate remedies. Here, we craft benchmark datasets for one linear and three different non-linear classification scenarios, in which the important class-conditional features are known by design, serving as ground truth explanations. Using novel quantitative metrics, we benchmark the explanation performance of a wide set of XAI methods across three deep learning model architectures. We show that popular XAI methods are often unable to significantly outperform random performance baselines and edge detection methods, attributing false-positive importance to features with no statistical relationship to the prediction target rather than truly important features. Moreover, we demonstrate that explanations derived from different model architectures can be vastly different; thus, prone to misinterpretation even under controlled conditions.

可解释 "人工智能（XAI）领域提出了一些备受赞誉的方法，这些方法试图让人类 "理解 "复杂的机器学习（ML）方法的决策，例如通过对输入特征赋予 "重要性 "分数。然而，由于缺乏正式的支持，人们并不清楚从特定 XAI 方法的结果中可以安全地得出什么结论，这也阻碍了 XAI 方法的理论验证和经验验证。这意味着，通常由深度神经网络解决的具有挑战性的非线性问题目前缺乏适当的补救措施。在这里，我们为一种线性分类和三种不同的非线性分类场景制作了基准数据集，其中重要的类条件特征在设计上是已知的，可作为地面实况解释。利用新颖的定量指标，我们对三种深度学习模型架构的各种 XAI 方法的解释性能进行了基准测试。我们的研究表明，流行的 XAI 方法往往无法显著超越随机性能基线和边缘检测方法，它们将假阳性重要性归因于与预测目标没有统计关系的特征，而不是真正重要的特征。此外，我们还证明了从不同模型架构中得出的解释可能大相径庭；因此，即使在受控条件下也容易产生误读。

{"title":"XAI-TRIS: non-linear image benchmarks to quantify false positive post-hoc attribution of feature importance","authors":"Benedict Clark, Rick Wilming, Stefan Haufe","doi":"10.1007/s10994-024-06574-3","DOIUrl":"https://doi.org/10.1007/s10994-024-06574-3","url":null,"abstract":"The field of ‘explainable’ artificial intelligence (XAI) has produced highly acclaimed methods that seek to make the decisions of complex machine learning (ML) methods ‘understandable’ to humans, for example by attributing ‘importance’ scores to input features. Yet, a lack of formal underpinning leaves it unclear as to what conclusions can safely be drawn from the results of a given XAI method and has also so far hindered the theoretical verification and empirical validation of XAI methods. This means that challenging non-linear problems, typically solved by deep neural networks, presently lack appropriate remedies. Here, we craft benchmark datasets for one linear and three different non-linear classification scenarios, in which the important class-conditional features are known by design, serving as ground truth explanations. Using novel quantitative metrics, we benchmark the explanation performance of a wide set of XAI methods across three deep learning model architectures. We show that popular XAI methods are often unable to significantly outperform random performance baselines and edge detection methods, attributing false-positive importance to features with no statistical relationship to the prediction target rather than truly important features. Moreover, we demonstrate that explanations derived from different model architectures can be vastly different; thus, prone to misinterpretation even under controlled conditions.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"22 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141717813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Partitioned least squares 分区最小二乘法

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-15 DOI: 10.1007/s10994-024-06582-3

Roberto Esposito, Mattia Cerrato, Marco Locatelli

Linear least squares is one of the most widely used regression methods in many fields. The simplicity of the model allows this method to be used when data is scarce and allows practitioners to gather some insight into the problem by inspecting the values of the learnt parameters. In this paper we propose a variant of the linear least squares model allowing practitioners to partition the input features into groups of variables that they require to contribute similarly to the final result. We show that the new formulation is not convex and provide two alternative methods to deal with the problem: one non-exact method based on an alternating least squares approach; and one exact method based on a reformulation of the problem. We show the correctness of the exact method and compare the two solutions showing that the exact solution provides better results in a fraction of the time required by the alternating least squares solution (when the number of partitions is small). We also provide a branch and bound algorithm that can be used in place of the exact method when the number of partitions is too large as well as a proof of NP-completeness of the optimization problem.

线性最小二乘法是许多领域最广泛使用的回归方法之一。该模型的简单性使其在数据稀缺的情况下也能使用，并允许从业人员通过检查所学参数的值对问题进行深入了解。在本文中，我们提出了线性最小二乘法模型的一种变体，允许从业人员将输入特征划分为变量组，要求这些变量对最终结果的贡献相似。我们证明了新表述并不具有凸性，并提供了两种处理问题的替代方法：一种是基于交替最小二乘法的非精确方法；另一种是基于问题重新表述的精确方法。我们证明了精确法的正确性，并对两种解法进行了比较，结果表明精确解法所需的时间仅为交替最小二乘法解法的一小部分（当分区数量较少时）。我们还提供了一种分支和约束算法，当分区数过大时，该算法可用于替代精确法，并证明了优化问题的 NP 完备性。

引用次数: 0

L2XGNN: learning to explain graph neural networks L2XGNN：学习解释图神经网络

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-12 DOI: 10.1007/s10994-024-06576-1

Giuseppe Serra, Mathias Niepert

Graph Neural Networks (GNNs) are a popular class of machine learning models. Inspired by the learning to explain (L2X) paradigm, we propose L2xGnn, a framework for explainable GNNs which provides faithful explanations by design. L2xGnn learns a mechanism for selecting explanatory subgraphs (motifs) which are exclusively used in the GNNs message-passing operations. L2xGnn is able to select, for each input graph, a subgraph with specific properties such as being sparse and connected. Imposing such constraints on the motifs often leads to more interpretable and effective explanations. Experiments on several datasets suggest that L2xGnn achieves the same classification accuracy as baseline methods using the entire input graph while ensuring that only the provided explanations are used to make predictions. Moreover, we show that L2xGnn is able to identify motifs responsible for the graph’s properties it is intended to predict.

图神经网络（GNN）是一类流行的机器学习模型。受 "学习解释"（L2X）范式的启发，我们提出了 L2xGnn--一种可解释的图神经网络框架，它通过设计提供忠实的解释。L2xGnn 学习一种选择解释性子图（图案）的机制，这些子图专门用于 GNN 的信息传递操作。L2xGnn 能够为每个输入图选择具有稀疏性和连接性等特定属性的子图。对主题图施加这样的限制往往会带来更多可解释性和更有效的解释。在多个数据集上的实验表明，L2xGnn 与使用整个输入图的基准方法达到了相同的分类准确率，同时确保只使用所提供的解释进行预测。此外，我们还证明了 L2xGnn 能够识别出负责其所要预测的图形属性的主题。

引用次数: 0

Compressed sensing: a discrete optimization approach 压缩传感：一种离散优化方法

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-11 DOI: 10.1007/s10994-024-06577-0

Dimitris Bertsimas, Nicholas A. G. Johnson

We study the Compressed Sensing (CS) problem, which is the problem of finding the most sparse vector that satisfies a set of linear measurements up to some numerical tolerance. CS is a central problem in Statistics, Operations Research and Machine Learning which arises in applications such as signal processing, data compression, image reconstruction, and multi-label learning. We introduce an (ell _2) regularized formulation of CS which we reformulate as a mixed integer second order cone program. We derive a second order cone relaxation of this problem and show that under mild conditions on the regularization parameter, the resulting relaxation is equivalent to the well studied basis pursuit denoising problem. We present a semidefinite relaxation that strengthens the second order cone relaxation and develop a custom branch-and-bound algorithm that leverages our second order cone relaxation to solve small-scale instances of CS to certifiable optimality. When compared against solutions produced by three state of the art benchmark methods on synthetic data, our numerical results show that our approach produces solutions that are on average (6.22%) more sparse. When compared only against the experiment-wise best performing benchmark method on synthetic data, our approach produces solutions that are on average (3.10%) more sparse. On real world ECG data, for a given (ell _2) reconstruction error our approach produces solutions that are on average (9.95%) more sparse than benchmark methods ((3.88%) more sparse if only compared against the best performing benchmark), while for a given sparsity level our approach produces solutions that have on average (10.77%) lower reconstruction error than benchmark methods ((1.42%) lower error if only compared against the best performing benchmark). When used as a component of a multi-label classification algorithm, our approach achieves greater classification accuracy than benchmark compressed sensing methods. This improved accuracy comes at the cost of an increase in computation time by several orders of magnitude. Thus, for applications where runtime is not of critical importance, leveraging integer optimization can yield sparser and lower error solutions to CS than existing benchmarks.

我们研究的是压缩传感（CS）问题，即寻找满足一组线性测量的最稀疏矢量，并达到一定的数值容差。CS 是统计学、运筹学和机器学习中的一个核心问题，在信号处理、数据压缩、图像重建和多标签学习等应用中都会出现。我们引入了 CS 的正则化表述，并将其重新表述为混合整数二阶锥形程序。我们推导出了这个问题的二阶圆锥松弛，并证明在正则化参数的温和条件下，所得到的松弛等价于研究得很透彻的基追求去噪问题。我们提出了一种加强二阶圆锥松弛的半有限松弛，并开发了一种定制的分支和边界算法，该算法利用我们的二阶圆锥松弛来解决 CS 的小规模实例，并达到可证明的最优性。与三种最先进的基准方法在合成数据上得出的解决方案相比，我们的数值结果表明，我们的方法得出的解决方案平均稀疏度更高（6.22%）。如果只与合成数据上实验性能最好的基准方法进行比较，我们的方法得出的解决方案平均稀疏度要更高（3.10%）。在真实世界的心电图数据上，对于给定的重构误差，我们的方法产生的解决方案比基准方法平均稀疏（9.95%）（如果只与表现最好的基准方法相比，则稀疏（3.88%）），而对于给定的稀疏程度，我们的方法产生的解决方案的重构误差比基准方法平均低（10.77%）（如果只与表现最好的基准方法相比，则误差低（1.42%））。当作为多标签分类算法的一个组成部分时，我们的方法比基准压缩传感方法实现了更高的分类精度。精度提高的代价是计算时间增加了几个数量级。因此，对于运行时间并不重要的应用，利用整数优化可以获得比现有基准更稀疏、误差更低的 CS 解决方案。

{"title":"Compressed sensing: a discrete optimization approach","authors":"Dimitris Bertsimas, Nicholas A. G. Johnson","doi":"10.1007/s10994-024-06577-0","DOIUrl":"https://doi.org/10.1007/s10994-024-06577-0","url":null,"abstract":"We study the Compressed Sensing (CS) problem, which is the problem of finding the most sparse vector that satisfies a set of linear measurements up to some numerical tolerance. CS is a central problem in Statistics, Operations Research and Machine Learning which arises in applications such as signal processing, data compression, image reconstruction, and multi-label learning. We introduce an (ell _2) regularized formulation of CS which we reformulate as a mixed integer second order cone program. We derive a second order cone relaxation of this problem and show that under mild conditions on the regularization parameter, the resulting relaxation is equivalent to the well studied basis pursuit denoising problem. We present a semidefinite relaxation that strengthens the second order cone relaxation and develop a custom branch-and-bound algorithm that leverages our second order cone relaxation to solve small-scale instances of CS to certifiable optimality. When compared against solutions produced by three state of the art benchmark methods on synthetic data, our numerical results show that our approach produces solutions that are on average (6.22%) more sparse. When compared only against the experiment-wise best performing benchmark method on synthetic data, our approach produces solutions that are on average (3.10%) more sparse. On real world ECG data, for a given (ell _2) reconstruction error our approach produces solutions that are on average (9.95%) more sparse than benchmark methods ((3.88%) more sparse if only compared against the best performing benchmark), while for a given sparsity level our approach produces solutions that have on average (10.77%) lower reconstruction error than benchmark methods ((1.42%) lower error if only compared against the best performing benchmark). When used as a component of a multi-label classification algorithm, our approach achieves greater classification accuracy than benchmark compressed sensing methods. This improved accuracy comes at the cost of an increase in computation time by several orders of magnitude. Thus, for applications where runtime is not of critical importance, leveraging integer optimization can yield sparser and lower error solutions to CS than existing benchmarks.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"56 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141611977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Explainable dating of greek papyri images 希腊纸莎草纸图像的可解释年代

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-11 DOI: 10.1007/s10994-024-06589-w

John Pavlopoulos, Maria Konstantinidou, Elpida Perdiki, Isabelle Marthot-Santaniello, Holger Essler, Georgios Vardakas, Aristidis Likas

Greek literary papyri, which are unique witnesses of antique literature, do not usually bear a date. They are thus currently dated based on palaeographical methods, with broad approximations which often span more than a century. We created a dataset of 242 images of papyri written in “bookhand” scripts whose date can be securely assigned, and we used it to train algorithms for the task of dating, showing its challenging nature. To address data scarcity, we extended our dataset by segmenting each image into its respective text lines. By using the line-based version of our dataset, we trained a Convolutional Neural Network, equipped with a fragmentation-based augmentation strategy, and we achieved a mean absolute error of 54 years. The results improve further when the task is cast as a multi-class classification problem, predicting the century. Using our network, we computed precise date estimations for papyri whose date is disputed or vaguely defined, employing explainability to understand dating-driving features.

希腊文学纸莎草纸是古代文学的独特见证，通常不标注日期。因此，目前只能根据古文字学的方法来确定它们的年代，大致的近似值往往跨越一个多世纪。我们创建了一个包含 242 幅以 "手写体 "书写的纸莎草纸图像的数据集，这些图像的日期可以确定。为了解决数据稀缺的问题，我们通过将每张图像分割成相应的文本行来扩展我们的数据集。通过使用基于行的数据集版本，我们训练了一个卷积神经网络，该网络配备了基于片段的增强策略，我们取得的平均绝对误差为 54 年。如果将该任务视为预测世纪的多类分类问题，结果会进一步改善。利用我们的网络，我们计算出了日期有争议或定义模糊的纸莎草纸的精确日期估计，利用可解释性来理解日期驱动特征。

引用次数: 0

Moreau-Yoshida variational transport: a general framework for solving regularized distributional optimization problems 莫罗-吉田变分传输：解决正则分布优化问题的一般框架

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-10 DOI: 10.1007/s10994-024-06586-z

Dai Hai Nguyen, Tetsuya Sakurai

We address a general optimization problem involving the minimization of a composite objective functional defined over a class of probability distributions. The objective function consists of two components: one assumed to have a variational representation, and the other expressed in terms of the expectation operator of a possibly nonsmooth convex regularizer function. Such a regularized distributional optimization problem widely appears in machine learning and statistics, including proximal Monte-Carlo sampling, Bayesian inference, and generative modeling for regularized estimation and generation. Our proposed method, named Moreau-Yoshida Variational Transport (MYVT), introduces a novel approach to tackle this regularized distributional optimization problem. First, as the name suggests, our method utilizes the Moreau-Yoshida envelope to provide a smooth approximation of the nonsmooth function in the objective. Second, we reformulate the approximate problem as a concave-convex saddle point problem by leveraging the variational representation. Subsequently, we develop an efficient primal–dual algorithm to approximate the saddle point. Furthermore, we provide theoretical analyses and present experimental results to showcase the effectiveness of the proposed method.

我们要解决的是一个一般优化问题，涉及最小化定义在一类概率分布上的复合目标函数。目标函数由两部分组成：一部分假定具有变分表示法，另一部分用可能是非光滑凸正则函数的期望算子表示。这种正则化分布优化问题广泛出现在机器学习和统计学领域，包括近似蒙特卡洛采样、贝叶斯推理以及用于正则化估计和生成的生成模型。我们提出的方法被命名为莫罗-吉田变分传输（Moreau-Yoshida Variational Transport，MYVT），它引入了一种新方法来解决这种正则化分布优化问题。首先，顾名思义，我们的方法利用莫罗-吉田包络为目标中的非光滑函数提供光滑近似值。其次，我们利用变分表示法将近似问题重新表述为凹凸鞍点问题。随后，我们开发了一种高效的初等二元算法来逼近鞍点。此外，我们还提供了理论分析和实验结果，以展示所提方法的有效性。

{"title":"Moreau-Yoshida variational transport: a general framework for solving regularized distributional optimization problems","authors":"Dai Hai Nguyen, Tetsuya Sakurai","doi":"10.1007/s10994-024-06586-z","DOIUrl":"https://doi.org/10.1007/s10994-024-06586-z","url":null,"abstract":"We address a general optimization problem involving the minimization of a composite objective functional defined over a class of probability distributions. The objective function consists of two components: one assumed to have a variational representation, and the other expressed in terms of the expectation operator of a possibly nonsmooth convex regularizer function. Such a regularized distributional optimization problem widely appears in machine learning and statistics, including proximal Monte-Carlo sampling, Bayesian inference, and generative modeling for regularized estimation and generation. Our proposed method, named Moreau-Yoshida Variational Transport (MYVT), introduces a novel approach to tackle this regularized distributional optimization problem. First, as the name suggests, our method utilizes the Moreau-Yoshida envelope to provide a smooth approximation of the nonsmooth function in the objective. Second, we reformulate the approximate problem as a concave-convex saddle point problem by leveraging the variational representation. Subsequently, we develop an efficient primal–dual algorithm to approximate the saddle point. Furthermore, we provide theoretical analyses and present experimental results to showcase the effectiveness of the proposed method.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"20 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141584981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Permutation-invariant linear classifiers 置换不变线性分类器

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-09 DOI: 10.1007/s10994-024-06561-8

Ludwig Lausser, Robin Szekely, Hans A. Kestler

Invariant concept classes form the backbone of classification algorithms immune to specific data transformations, ensuring consistent predictions regardless of these alterations. However, this robustness can come at the cost of limited access to the original sample information, potentially impacting generalization performance. This study introduces an addition to these classes—the permutation-invariant linear classifiers. Distinguished by their structural characteristics, permutation-invariant linear classifiers are unaffected by permutations on feature vectors, a property not guaranteed by other non-constant linear classifiers. The study characterizes this new concept class, highlighting its constant capacity, independent of input dimensionality. In practical assessments using linear support vector machines, the permutation-invariant classifiers exhibit superior performance in permutation experiments on artificial datasets and real mutation profiles. Interestingly, they outperform general linear classifiers not only in permutation experiments but also in permutation-free settings, surpassing unconstrained counterparts. Additionally, findings from real mutation profiles support the significance of tumor mutational burden as a biomarker.

不变概念类是不受特定数据转换影响的分类算法的支柱，可确保预测结果的一致性，而不受这些变化的影响。然而，这种鲁棒性的代价可能是对原始样本信息的访问有限，从而可能影响泛化性能。本研究介绍了这些分类器中的新成员--置换不变线性分类器。包络不变线性分类器的结构特点是不受特征向量包络变换的影响，这是其他非恒定线性分类器无法保证的。本研究描述了这一新概念类别的特征，强调了其与输入维度无关的恒定能力。在使用线性支持向量机进行的实际评估中，包覆不变分类器在人工数据集和真实突变剖面的包覆实验中表现出卓越的性能。有趣的是，它们不仅在变异实验中表现优于一般线性分类器，而且在无变异设置中表现也优于无约束分类器。此外，真实突变图谱的研究结果也证明了肿瘤突变负荷作为生物标记物的重要性。

{"title":"Permutation-invariant linear classifiers","authors":"Ludwig Lausser, Robin Szekely, Hans A. Kestler","doi":"10.1007/s10994-024-06561-8","DOIUrl":"https://doi.org/10.1007/s10994-024-06561-8","url":null,"abstract":"Invariant concept classes form the backbone of classification algorithms immune to specific data transformations, ensuring consistent predictions regardless of these alterations. However, this robustness can come at the cost of limited access to the original sample information, potentially impacting generalization performance. This study introduces an addition to these classes—the permutation-invariant linear classifiers. Distinguished by their structural characteristics, permutation-invariant linear classifiers are unaffected by permutations on feature vectors, a property not guaranteed by other non-constant linear classifiers. The study characterizes this new concept class, highlighting its constant capacity, independent of input dimensionality. In practical assessments using linear support vector machines, the permutation-invariant classifiers exhibit superior performance in permutation experiments on artificial datasets and real mutation profiles. Interestingly, they outperform general linear classifiers not only in permutation experiments but also in permutation-free settings, surpassing unconstrained counterparts. Additionally, findings from real mutation profiles support the significance of tumor mutational burden as a biomarker.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"65 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Regional bias in monolingual English language models 单语英语语言模型中的地区偏差

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-09 DOI: 10.1007/s10994-024-06555-6

Jiachen Lyu, Katharina Dost, Yun Sing Koh, Jörg Wicker

In Natural Language Processing (NLP), pre-trained language models (LLMs) are widely employed and refined for various tasks. These models have shown considerable social and geographic biases creating skewed or even unfair representations of certain groups. Research focuses on biases toward L2 (English as a second language) regions but neglects bias within L1 (first language) regions. In this work, we ask if there is regional bias within L1 regions already inherent in pre-trained LLMs and, if so, what the consequences are in terms of downstream model performance. We contribute an investigation framework specifically tailored for low-resource regions, offering a method to identify bias without imposing strict requirements for labeled datasets. Our research reveals subtle geographic variations in the word embeddings of BERT, even in cultures traditionally perceived as similar. These nuanced features, once captured, have the potential to significantly impact downstream tasks. Generally, models exhibit comparable performance on datasets that share similarities, and conversely, performance may diverge when datasets differ in their nuanced features embedded within the language. It is crucial to note that estimating model performance solely based on standard benchmark datasets may not necessarily apply to the datasets with distinct features from the benchmark datasets. Our proposed framework plays a pivotal role in identifying and addressing biases detected in word embeddings, particularly evident in low-resource regions such as New Zealand.

在自然语言处理（NLP）领域，预训练语言模型（LLMs）被广泛使用，并在各种任务中得到改进。这些模型显示出相当大的社会和地理偏差，对某些群体造成了偏斜甚至不公平的表述。研究的重点是 L2（英语作为第二语言）地区的偏差，但忽略了 L1（第一语言）地区的偏差。在这项工作中，我们要问的是，在预训练的 LLM 中是否已经存在 L1 区域内固有的区域偏差，如果存在，那么在下游模型性能方面会产生什么后果。我们提出了一个专门针对低资源地区的调查框架，提供了一种无需对标记数据集提出严格要求即可识别偏差的方法。我们的研究揭示了 BERT 词嵌入的微妙地域差异，即使是在传统上被视为相似的文化中也是如此。这些细微的特征一旦被捕捉到，就有可能对下游任务产生重大影响。一般来说，模型在具有相似性的数据集上表现出相当的性能，反之，当数据集在语言嵌入的细微特征上存在差异时，性能可能会出现差异。需要注意的是，仅根据标准基准数据集估算模型性能并不一定适用于与基准数据集具有不同特征的数据集。我们提出的框架在识别和解决词嵌入中发现的偏差方面发挥了关键作用，这在新西兰等低资源地区尤为明显。

{"title":"Regional bias in monolingual English language models","authors":"Jiachen Lyu, Katharina Dost, Yun Sing Koh, Jörg Wicker","doi":"10.1007/s10994-024-06555-6","DOIUrl":"https://doi.org/10.1007/s10994-024-06555-6","url":null,"abstract":"In Natural Language Processing (NLP), pre-trained language models (LLMs) are widely employed and refined for various tasks. These models have shown considerable social and geographic biases creating skewed or even unfair representations of certain groups. Research focuses on biases toward L2 (English as a second language) regions but neglects bias within L1 (first language) regions. In this work, we ask if there is regional bias within L1 regions already inherent in pre-trained LLMs and, if so, what the consequences are in terms of downstream model performance. We contribute an investigation framework specifically tailored for low-resource regions, offering a method to identify bias without imposing strict requirements for labeled datasets. Our research reveals subtle geographic variations in the word embeddings of BERT, even in cultures traditionally perceived as similar. These nuanced features, once captured, have the potential to significantly impact downstream tasks. Generally, models exhibit comparable performance on datasets that share similarities, and conversely, performance may diverge when datasets differ in their nuanced features embedded within the language. It is crucial to note that estimating model performance solely based on standard benchmark datasets may not necessarily apply to the datasets with distinct features from the benchmark datasets. Our proposed framework plays a pivotal role in identifying and addressing biases detected in word embeddings, particularly evident in low-resource regions such as New Zealand.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"35 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Conformal predictions for probabilistically robust scalable machine learning classification 针对概率稳健可扩展机器学习分类的共形预测

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-09 DOI: 10.1007/s10994-024-06571-6

Alberto Carlevaro, Teodoro Alamo, Fabrizio Dabbene, Maurizio Mongelli

Conformal predictions make it possible to define reliable and robust learning algorithms. But they are essentially a method for evaluating whether an algorithm is good enough to be used in practice. To define a reliable learning framework for classification from the very beginning of its design, the concept of scalable classifier was introduced to generalize the concept of classical classifier by linking it to statistical order theory and probabilistic learning theory. In this paper, we analyze the similarities between scalable classifiers and conformal predictions by introducing a new definition of a score function and defining a special set of input variables, the conformal safety set, which can identify patterns in the input space that satisfy the error coverage guarantee, i.e., that the probability of observing the wrong (possibly unsafe) label for points belonging to this set is bounded by a predefined (varepsilon) error level. We demonstrate the practical implications of this framework through an application in cybersecurity for identifying DNS tunneling attacks. Our work contributes to the development of probabilistically robust and reliable machine learning models.

共形预测使我们有可能定义可靠、稳健的学习算法。但是，它们本质上是一种评估算法是否足以在实践中使用的方法。为了从设计之初就定义可靠的分类学习框架，我们引入了可扩展分类器的概念，通过将其与统计秩理论和概率学习理论联系起来，对经典分类器的概念进行了概括。在本文中，我们分析了可扩展分类器和保形预测之间的相似性，引入了得分函数的新定义，并定义了一个特殊的输入变量集--保形安全集，它可以识别输入空间中满足误差覆盖保证的模式，即观察到属于该集的点的错误（可能不安全）标签的概率被预定义的(varepsilon)误差水平所限制。我们通过在网络安全中识别 DNS 隧道攻击的应用，展示了这一框架的实际意义。我们的工作有助于开发概率上稳健可靠的机器学习模型。

{"title":"Conformal predictions for probabilistically robust scalable machine learning classification","authors":"Alberto Carlevaro, Teodoro Alamo, Fabrizio Dabbene, Maurizio Mongelli","doi":"10.1007/s10994-024-06571-6","DOIUrl":"https://doi.org/10.1007/s10994-024-06571-6","url":null,"abstract":"Conformal predictions make it possible to define reliable and robust learning algorithms. But they are essentially a method for evaluating whether an algorithm is good enough to be used in practice. To define a reliable learning framework for classification from the very beginning of its design, the concept of scalable classifier was introduced to generalize the concept of classical classifier by linking it to statistical order theory and probabilistic learning theory. In this paper, we analyze the similarities between scalable classifiers and conformal predictions by introducing a new definition of a score function and defining a special set of input variables, the conformal safety set, which can identify patterns in the input space that satisfy the error coverage guarantee, i.e., that the probability of observing the wrong (possibly unsafe) label for points belonging to this set is bounded by a predefined (varepsilon) error level. We demonstrate the practical implications of this framework through an application in cybersecurity for identifying DNS tunneling attacks. Our work contributes to the development of probabilistically robust and reliable machine learning models.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"72 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Neural discovery of balance-aware polarized communities 神经发现平衡感知极化群落

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-09 DOI: 10.1007/s10994-024-06581-4

Francesco Gullo, Domenico Mandaglio, Andrea Tagarelli

Signed graphs are a model to depict friendly (positive) or antagonistic (negative) interactions (edges) among users (nodes). 2-Polarized-Communities (2pc) is a well-established combinatorial-optimization problem whose goal is to find two polarized communities from a signed graph, i.e., two subsets of nodes (disjoint, but not necessarily covering the entire node set) which exhibit a high number of both intra-community positive edges and negative inter-community edges. The state of the art in 2pc suffers from the limitations that (i) existing methods rely on a single (optimal) solution to a continuous relaxation of the problem in order to produce the ultimate discrete solution via rounding, and (ii) 2pc objective function comes with no control on size balance among communities. In this paper, we provide advances to the 2pc problem by addressing both these limitations, with a twofold contribution. First, we devise a novel neural approach that allows for soundly and elegantly explore a variety of suboptimal solutions to the relaxed 2pc problem, so as to pick the one that leads to the best discrete solution after rounding. Second, we introduce a generalization of 2pc objective function – termed (gamma )-polarity – which fosters size balance among communities, and we incorporate it into the proposed machine-learning framework. Extensive experiments attest high accuracy of our approach, its superiority over the state of the art, and capability of function (gamma )-polarity to discover high-quality size-balanced communities.

签名图是一种描述用户（节点）之间友好（积极）或敌对（消极）互动（边）的模型。两极化社群（2pc）是一个成熟的组合优化问题，其目标是从签名图中找到两个两极化社群，即两个节点子集（不相交，但不一定覆盖整个节点集），这两个子集显示出大量的社群内正向边和社群间负向边。2pc 技术的现状存在以下局限性：(i) 现有方法依赖于问题连续松弛的单一（最优）解，以便通过舍入产生最终的离散解；(ii) 2pc 目标函数无法控制群落间的大小平衡。在本文中，我们通过解决这两个局限性，对 2pc 问题做出了两方面的贡献。首先，我们设计了一种新颖的神经方法，可以合理、优雅地探索松弛 2pc 问题的各种次优解，从而选出舍入后的最佳离散解。其次，我们引入了 2pc 目标函数的广义化--称为 (gamma )-极性（polarity）--它促进了社区之间的规模平衡，我们将其纳入了所提出的机器学习框架。广泛的实验证明了我们的方法具有很高的准确性，它优于目前的技术水平，而且函数（(gamma )-polarity）有能力发现高质量的大小平衡的社区。

{"title":"Neural discovery of balance-aware polarized communities","authors":"Francesco Gullo, Domenico Mandaglio, Andrea Tagarelli","doi":"10.1007/s10994-024-06581-4","DOIUrl":"https://doi.org/10.1007/s10994-024-06581-4","url":null,"abstract":"Signed graphs are a model to depict friendly (positive) or antagonistic (negative) interactions (edges) among users (nodes). 2-Polarized-Communities (2pc) is a well-established combinatorial-optimization problem whose goal is to find two polarized communities from a signed graph, i.e., two subsets of nodes (disjoint, but not necessarily covering the entire node set) which exhibit a high number of both intra-community positive edges and negative inter-community edges. The state of the art in 2pc suffers from the limitations that (i) existing methods rely on a single (optimal) solution to a continuous relaxation of the problem in order to produce the ultimate discrete solution via rounding, and (ii) 2pc objective function comes with no control on size balance among communities. In this paper, we provide advances to the 2pc problem by addressing both these limitations, with a twofold contribution. First, we devise a novel neural approach that allows for soundly and elegantly explore a variety of suboptimal solutions to the relaxed 2pc problem, so as to pick the one that leads to the best discrete solution after rounding. Second, we introduce a generalization of 2pc objective function – termed (gamma )-polarity – which fosters size balance among communities, and we incorporate it into the proposed machine-learning framework. Extensive experiments attest high accuracy of our approach, its superiority over the state of the art, and capability of function (gamma )-polarity to discover high-quality size-balanced communities.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"179 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141577849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0