Pub Date : 2024-08-19DOI: 10.1007/s10994-024-06605-z
Zhilin Zhao, Longbing Cao
A standard network pretrained on in-distribution (ID) samples could make high-confidence predictions on out-of-distribution (OOD) samples, leaving the possibility of failing to distinguish ID and OOD samples in the test phase. To address this over-confidence issue, the existing methods improve the OOD sensitivity from modeling perspectives, i.e., retraining it by modifying training processes or objective functions. In contrast, this paper proposes a simple but effective method, namely Weighted Non-IID Batching (WNB), by adjusting batch weights. WNB builds on a key observation: increasing the batch size can improve the OOD detection performance. This is because a smaller batch size may make its batch samples more likely to be treated as non-IID from the assumed ID, i.e., associated with an OOD. This causes a network to provide high-confidence predictions for all samples from the OOD. Accordingly, WNB applies a weight function to weight each batch according to the discrepancy between batch samples and the entire training ID dataset. Specifically, the weight function is derived by minimizing the generalization error bound. It ensures that the weight function assigns larger weights to batches with smaller discrepancies and makes a trade-off between ID classification and OOD detection performance. Experimental results show that incorporating WNB into state-of-the-art OOD detection methods can further improve their performance.
对分布内(ID)样本进行预训练的标准网络可以对分布外(OOD)样本进行高置信度预测,但在测试阶段可能无法区分 ID 和 OOD 样本。为解决这一过度置信问题,现有方法从建模角度提高了 OOD 灵敏度,即通过修改训练过程或目标函数对其进行再训练。相比之下,本文提出了一种简单而有效的方法,即通过调整批次权重来实现加权非 IID 批处理(WNB)。WNB 基于一个重要的观察结果:增加批次大小可以提高 OOD 检测性能。这是因为,较小的批次规模可能会使其批次样本更有可能从假定的 ID 被视为非 IID,即与 OOD 相关联。这将导致网络对来自 OOD 的所有样本提供高置信度预测。因此,WNB 根据批次样本与整个训练 ID 数据集之间的差异,应用加权函数对每个批次进行加权。具体来说,权重函数是通过最小化泛化误差边界得出的。它确保权重函数为差异较小的批次分配较大的权重,并在 ID 分类和 OOD 检测性能之间做出权衡。实验结果表明,将 WNB 纳入最先进的 OOD 检测方法可以进一步提高其性能。
{"title":"Weighting non-IID batches for out-of-distribution detection","authors":"Zhilin Zhao, Longbing Cao","doi":"10.1007/s10994-024-06605-z","DOIUrl":"https://doi.org/10.1007/s10994-024-06605-z","url":null,"abstract":"<p>A standard network pretrained on in-distribution (ID) samples could make high-confidence predictions on out-of-distribution (OOD) samples, leaving the possibility of failing to distinguish ID and OOD samples in the test phase. To address this over-confidence issue, the existing methods improve the OOD sensitivity from modeling perspectives, i.e., retraining it by modifying training processes or objective functions. In contrast, this paper proposes a simple but effective method, namely Weighted Non-IID Batching (WNB), by adjusting batch weights. WNB builds on a key observation: increasing the batch size can improve the OOD detection performance. This is because a smaller batch size may make its batch samples more likely to be treated as non-IID from the assumed ID, i.e., associated with an OOD. This causes a network to provide high-confidence predictions for all samples from the OOD. Accordingly, WNB applies a weight function to weight each batch according to the discrepancy between batch samples and the entire training ID dataset. Specifically, the weight function is derived by minimizing the generalization error bound. It ensures that the weight function assigns larger weights to batches with smaller discrepancies and makes a trade-off between ID classification and OOD detection performance. Experimental results show that incorporating WNB into state-of-the-art OOD detection methods can further improve their performance.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"267 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-07DOI: 10.1007/s10994-024-06557-4
Xavier Renard, Thibault Laugel, Marcin Detyniecki
A multitude of classifiers can be trained on the same data to achieve similar performances during test time while having learned significantly different classification patterns. When selecting a classifier, the machine learning practitioner has no understanding on the differences between models, their limits, where they agree and where they don’t. But this choice will result in concrete consequences for instances to be classified in the discrepancy zone, since the final decision will be based on the selected classification pattern. Besides the arbitrary nature of the result, a bad choice could have further negative consequences such as loss of opportunity or lack of fairness. This paper proposes to address this question by analyzing the prediction discrepancies in a pool of best-performing models trained on the same data. A model-agnostic algorithm, DIG, is proposed to capture and explain discrepancies locally in tabular datasets, to enable the practitioner to make the best educated decision when selecting a model by anticipating its potential undesired consequences.
{"title":"Understanding prediction discrepancies in classification","authors":"Xavier Renard, Thibault Laugel, Marcin Detyniecki","doi":"10.1007/s10994-024-06557-4","DOIUrl":"https://doi.org/10.1007/s10994-024-06557-4","url":null,"abstract":"<p>A multitude of classifiers can be trained on the same data to achieve similar performances during test time while having learned significantly different classification patterns. When selecting a classifier, the machine learning practitioner has no understanding on the differences between models, their limits, where they agree and where they don’t. But this choice will result in concrete consequences for instances to be classified in the discrepancy zone, since the final decision will be based on the selected classification pattern. Besides the arbitrary nature of the result, a bad choice could have further negative consequences such as loss of opportunity or lack of fairness. This paper proposes to address this question by analyzing the prediction discrepancies in a pool of best-performing models trained on the same data. A model-agnostic algorithm, DIG, is proposed to <i>capture and explain</i> discrepancies locally in tabular datasets, to enable the practitioner to make the best educated decision when selecting a model by anticipating its potential undesired consequences.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"13 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-07DOI: 10.1007/s10994-024-06599-8
Eric F. Lock
Data for several applications in diverse fields can be represented as multiple matrices that are linked across rows or columns. This is particularly common in molecular biomedical research, in which multiple molecular “omics” technologies may capture different feature sets (e.g., corresponding to rows in a matrix) and/or different sample populations (corresponding to columns). This has motivated a large body of work on integrative matrix factorization approaches that identify and decompose low-dimensional signal that is shared across multiple matrices or specific to a given matrix. We propose an empirical variational Bayesian approach to this problem that has several advantages over existing techniques, including the flexibility to accommodate shared signal over any number of row or column sets (i.e., bidimensional integration), an intuitive model-based objective function that yields appropriate shrinkage for the inferred signals, and a relatively efficient estimation algorithm with no tuning parameters. A general result establishes conditions for the uniqueness of the underlying decomposition for a broad family of methods that includes the proposed approach. For scenarios with missing data, we describe an associated iterative imputation approach that is novel for the single-matrix context and a powerful approach for “blockwise” imputation (in which an entire row or column is missing) in various linked matrix contexts. Extensive simulations show that the method performs very well under different scenarios with respect to recovering underlying low-rank signal, accurately decomposing shared and specific signals, and accurately imputing missing data. The approach is applied to gene expression and miRNA data from breast cancer tissue and normal breast tissue, for which it gives an informative decomposition of variation and outperforms alternative strategies for missing data imputation.
{"title":"Empirical Bayes linked matrix decomposition","authors":"Eric F. Lock","doi":"10.1007/s10994-024-06599-8","DOIUrl":"https://doi.org/10.1007/s10994-024-06599-8","url":null,"abstract":"<p>Data for several applications in diverse fields can be represented as multiple matrices that are linked across rows or columns. This is particularly common in molecular biomedical research, in which multiple molecular “omics” technologies may capture different feature sets (e.g., corresponding to rows in a matrix) and/or different sample populations (corresponding to columns). This has motivated a large body of work on integrative matrix factorization approaches that identify and decompose low-dimensional signal that is shared across multiple matrices or specific to a given matrix. We propose an empirical variational Bayesian approach to this problem that has several advantages over existing techniques, including the flexibility to accommodate shared signal over any number of row or column sets (i.e., bidimensional integration), an intuitive model-based objective function that yields appropriate shrinkage for the inferred signals, and a relatively efficient estimation algorithm with no tuning parameters. A general result establishes conditions for the uniqueness of the underlying decomposition for a broad family of methods that includes the proposed approach. For scenarios with missing data, we describe an associated iterative imputation approach that is novel for the single-matrix context and a powerful approach for “blockwise” imputation (in which an entire row or column is missing) in various linked matrix contexts. Extensive simulations show that the method performs very well under different scenarios with respect to recovering underlying low-rank signal, accurately decomposing shared and specific signals, and accurately imputing missing data. The approach is applied to gene expression and miRNA data from breast cancer tissue and normal breast tissue, for which it gives an informative decomposition of variation and outperforms alternative strategies for missing data imputation.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"24 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-07DOI: 10.1007/s10994-024-06601-3
Victoria Manfredi, Alicia P. Wolfe, Xiaolan Zhang, Bing Wang
Mobile wireless networks present several challenges for any learning system, due to uncertain and variable device movement, a decentralized network architecture, and constraints on network resources. In this work, we use deep reinforcement learning (DRL) to learn a scalable and generalizable forwarding strategy for such networks. We make the following contributions: (i) we use hierarchical RL to design DRL packet agents rather than device agents to capture the packet forwarding decisions that are made over time and improve training efficiency; (ii) we use relational features to ensure generalizability of the learned forwarding strategy to a wide range of network dynamics and enable offline training; and (iii) we incorporate both forwarding goals and network resource considerations into packet decision-making by designing a weighted reward function. Our results show that the forwarding strategy used by our DRL packet agent often achieves a similar delay per packet delivered as the oracle forwarding strategy and almost always outperforms all other strategies (including state-of-the-art strategies) in terms of delay, even on scenarios on which the DRL agent was not trained.
{"title":"Learning an adaptive forwarding strategy for mobile wireless networks: resource usage vs. latency","authors":"Victoria Manfredi, Alicia P. Wolfe, Xiaolan Zhang, Bing Wang","doi":"10.1007/s10994-024-06601-3","DOIUrl":"https://doi.org/10.1007/s10994-024-06601-3","url":null,"abstract":"<p>Mobile wireless networks present several challenges for any learning system, due to uncertain and variable device movement, a decentralized network architecture, and constraints on network resources. In this work, we use deep reinforcement learning (DRL) to learn a scalable and generalizable forwarding strategy for such networks. We make the following contributions: (i) we use hierarchical RL to design DRL packet agents rather than device agents to capture the packet forwarding decisions that are made over time and improve training efficiency; (ii) we use relational features to ensure generalizability of the learned forwarding strategy to a wide range of network dynamics and enable offline training; and (iii) we incorporate both forwarding goals and network resource considerations into packet decision-making by designing a weighted reward function. Our results show that the forwarding strategy used by our DRL packet agent often achieves a similar delay per packet delivered as the oracle forwarding strategy and almost always outperforms all other strategies (including state-of-the-art strategies) in terms of delay, even on scenarios on which the DRL agent was not trained.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"79 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-05DOI: 10.1007/s10994-024-06597-w
Yunzhe Zhou, Peiru Xu, Giles Hooker
Model distillation has been a popular method for producing interpretable machine learning. It uses an interpretable “student” model to mimic the predictions made by the black box “teacher” model. However, when the student model is sensitive to the variability of the data sets used for training even when keeping the teacher fixed, the corresponded interpretation is not reliable. Existing strategies stabilize model distillation by checking whether a large enough sample of pseudo-data is generated to reliably reproduce student models, but methods to do so have so far been developed separately for each specific class of student model. In this paper, we develop a generic approach for stable model distillation based on central limit theorem for the estimated fidelity of the student to the teacher. We start with a collection of candidate student models and search for candidates that reasonably agree with the teacher. Then we construct a multiple testing framework to select a sample size such that the consistent student model would be selected under different pseudo samples. We demonstrate the application of our proposed approach on three commonly used intelligible models: decision trees, falling rule lists and symbolic regression. Finally, we conduct simulation experiments on Mammographic Mass and Breast Cancer datasets and illustrate the testing procedure throughout a theoretical analysis with Markov process. The code is publicly available at https://github.com/yunzhe-zhou/GenericDistillation.
{"title":"A generic approach for reproducible model distillation","authors":"Yunzhe Zhou, Peiru Xu, Giles Hooker","doi":"10.1007/s10994-024-06597-w","DOIUrl":"https://doi.org/10.1007/s10994-024-06597-w","url":null,"abstract":"<p>Model distillation has been a popular method for producing interpretable machine learning. It uses an interpretable “student” model to mimic the predictions made by the black box “teacher” model. However, when the student model is sensitive to the variability of the data sets used for training even when keeping the teacher fixed, the corresponded interpretation is not reliable. Existing strategies stabilize model distillation by checking whether a large enough sample of pseudo-data is generated to reliably reproduce student models, but methods to do so have so far been developed separately for each specific class of student model. In this paper, we develop a generic approach for stable model distillation based on central limit theorem for the estimated fidelity of the student to the teacher. We start with a collection of candidate student models and search for candidates that reasonably agree with the teacher. Then we construct a multiple testing framework to select a sample size such that the consistent student model would be selected under different pseudo samples. We demonstrate the application of our proposed approach on three commonly used intelligible models: decision trees, falling rule lists and symbolic regression. Finally, we conduct simulation experiments on Mammographic Mass and Breast Cancer datasets and illustrate the testing procedure throughout a theoretical analysis with Markov process. The code is publicly available at https://github.com/yunzhe-zhou/GenericDistillation.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"23 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1007/s10994-024-06584-1
Ekaterina Antonenko, Ander Carreño, Jesse Read
Missing values are a common problem in data science and machine learning. Removing instances with missing values is a straightforward workaround, but this can significantly hinder subsequent data analysis, particularly when features outnumber instances. There are a variety of methodologies proposed in the literature for imputing missing values. Denoising Autoencoders, for example, have been leveraged efficiently for imputation. However, neural network approaches have been relatively less effective on smaller datasets. In this work, we propose Autoreplicative Random Forests (ARF) as a multi-output learning approach, which we introduce in the context of a framework that may impute via either an iterative or procedural process. Experiments on several low- and high-dimensional datasets show that ARF is computationally efficient and exhibits better imputation performance than its competitors, including neural network approaches. In order to provide statistical analysis and mathematical background to the proposed missing value imputation framework, we also propose probabilistic ARFs, where the confidence values are provided over different imputation hypotheses, therefore maximizing the utility of such a framework in a machine-learning pipeline targeting predictive performance.
{"title":"Autoreplicative random forests with applications to missing value imputation","authors":"Ekaterina Antonenko, Ander Carreño, Jesse Read","doi":"10.1007/s10994-024-06584-1","DOIUrl":"https://doi.org/10.1007/s10994-024-06584-1","url":null,"abstract":"<p>Missing values are a common problem in data science and machine learning. Removing instances with missing values is a straightforward workaround, but this can significantly hinder subsequent data analysis, particularly when features outnumber instances. There are a variety of methodologies proposed in the literature for imputing missing values. Denoising Autoencoders, for example, have been leveraged efficiently for imputation. However, neural network approaches have been relatively less effective on smaller datasets. In this work, we propose Autoreplicative Random Forests (ARF) as a multi-output learning approach, which we introduce in the context of a framework that may impute via either an iterative or procedural process. Experiments on several low- and high-dimensional datasets show that ARF is computationally efficient and exhibits better imputation performance than its competitors, including neural network approaches. In order to provide statistical analysis and mathematical background to the proposed missing value imputation framework, we also propose probabilistic ARFs, where the confidence values are provided over different imputation hypotheses, therefore maximizing the utility of such a framework in a machine-learning pipeline targeting predictive performance.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"219 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141886519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1007/s10994-024-06600-4
Andrea Failla, Rémy Cazabet, Giulio Rossetti, Salvatore Citraro
Groups—such as clusters of points or communities of nodes—are fundamental when addressing various data mining tasks. In temporal data, the predominant approach for characterizing group evolution has been through the identification of “events”. However, the events usually described in the literature, e.g., shrinks/growths, splits/merges, are often arbitrarily defined, creating a gap between such theoretical/predefined types and real-data group observations. Moving beyond existing taxonomies, we think of events as “archetypes” characterized by a unique combination of quantitative dimensions that we call “facets”. Group dynamics are defined by their position within the facet space, where archetypal events occupy extremities. Thus, rather than enforcing strict event types, our approach can allow for hybrid descriptions of dynamics involving group proximity to multiple archetypes. We apply our framework to evolving groups from several face-to-face interaction datasets, showing it enables richer, more reliable characterization of group dynamics with respect to state-of-the-art methods, especially when the groups are subject to complex relationships. Our approach also offers intuitive solutions to common tasks related to dynamic group analysis, such as choosing an appropriate aggregation scale, quantifying partition stability, and evaluating event quality.
{"title":"Describing group evolution in temporal data using multi-faceted events","authors":"Andrea Failla, Rémy Cazabet, Giulio Rossetti, Salvatore Citraro","doi":"10.1007/s10994-024-06600-4","DOIUrl":"https://doi.org/10.1007/s10994-024-06600-4","url":null,"abstract":"<p>Groups—such as clusters of points or communities of nodes—are fundamental when addressing various data mining tasks. In temporal data, the predominant approach for characterizing group evolution has been through the identification of “events”. However, the events usually described in the literature, e.g., shrinks/growths, splits/merges, are often arbitrarily defined, creating a gap between such theoretical/predefined types and real-data group observations. Moving beyond existing taxonomies, we think of events as “archetypes” characterized by a unique combination of quantitative dimensions that we call “facets”. Group dynamics are defined by their position within the facet space, where archetypal events occupy extremities. Thus, rather than enforcing strict event types, our approach can allow for hybrid descriptions of dynamics involving group proximity to multiple archetypes. We apply our framework to evolving groups from several face-to-face interaction datasets, showing it enables richer, more reliable characterization of group dynamics with respect to state-of-the-art methods, especially when the groups are subject to complex relationships. Our approach also offers intuitive solutions to common tasks related to dynamic group analysis, such as choosing an appropriate aggregation scale, quantifying partition stability, and evaluating event quality.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"78 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-31DOI: 10.1007/s10994-024-06551-w
Mark Kiermayer, Christian Weiß
Markov chains play a key role in a vast number of areas, including life insurance mathematics. Standard actuarial quantities as the premium value can be interpreted as compressed, lossy information about the underlying Markov process. We introduce a method to reconstruct the underlying Markov chain given collective information of a portfolio of contracts. Our neural architecture characterizes the process in a highly explainable way by explicitly providing one-step transition probabilities. Further, we provide an intrinsic, economic model validation to inspect the quality of the information decompression. Lastly, our methodology is successfully tested for a realistic data set of German term life insurance contracts.
{"title":"Neural calibration of hidden inhomogeneous Markov chains: information decompression in life insurance","authors":"Mark Kiermayer, Christian Weiß","doi":"10.1007/s10994-024-06551-w","DOIUrl":"https://doi.org/10.1007/s10994-024-06551-w","url":null,"abstract":"<p>Markov chains play a key role in a vast number of areas, including life insurance mathematics. Standard actuarial quantities as the premium value can be interpreted as compressed, lossy information about the underlying Markov process. We introduce a method to reconstruct the underlying Markov chain given collective information of a portfolio of contracts. Our neural architecture characterizes the process in a highly explainable way by explicitly providing one-step transition probabilities. Further, we provide an intrinsic, economic model validation to inspect the quality of the information decompression. Lastly, our methodology is successfully tested for a realistic data set of German term life insurance contracts.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"22 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-29DOI: 10.1007/s10994-024-06588-x
Rogério Ribeiro, Athos Moraes, Marta Moreno, Pedro G. Ferreira
Aging involves complex biological processes leading to the decline of living organisms. As population lifespan increases worldwide, the importance of identifying factors underlying healthy aging has become critical. Integration of multi-modal datasets is a powerful approach for the analysis of complex biological systems, with the potential to uncover novel aging biomarkers. In this study, we leveraged publicly available epigenomic, transcriptomic and telomere length data along with histological images from the Genotype-Tissue Expression project to build tissue-specific regression models for age prediction. Using data from two tissues, lung and ovary, we aimed to compare model performance across data modalities, as well as to assess the improvement resulting from integrating multiple data types. Our results demostrate that methylation outperformed the other data modalities, with a mean absolute error of 3.36 and 4.36 in the test sets for lung and ovary, respectively. These models achieved lower error rates when compared with established state-of-the-art tissue-agnostic methylation models, emphasizing the importance of a tissue-specific approach. Additionally, this work has shown how the application of Hierarchical Image Pyramid Transformers for feature extraction significantly enhances age modeling using histological images. Finally, we evaluated the benefits of integrating multiple data modalities into a single model. Combining methylation data with other data modalities only marginally improved performance likely due to the limited number of available samples. Combining gene expression with histological features yielded more accurate age predictions compared with the individual performance of these data types. Given these results, this study shows how machine learning applications can be extended to/in multi-modal aging research. Code used is available at https://github.com/zroger49/multi_modal_age_prediction.
{"title":"Integration of multi-modal datasets to estimate human aging","authors":"Rogério Ribeiro, Athos Moraes, Marta Moreno, Pedro G. Ferreira","doi":"10.1007/s10994-024-06588-x","DOIUrl":"https://doi.org/10.1007/s10994-024-06588-x","url":null,"abstract":"<p>Aging involves complex biological processes leading to the decline of living organisms. As population lifespan increases worldwide, the importance of identifying factors underlying healthy aging has become critical. Integration of multi-modal datasets is a powerful approach for the analysis of complex biological systems, with the potential to uncover novel aging biomarkers. In this study, we leveraged publicly available epigenomic, transcriptomic and telomere length data along with histological images from the Genotype-Tissue Expression project to build tissue-specific regression models for age prediction. Using data from two tissues, lung and ovary, we aimed to compare model performance across data modalities, as well as to assess the improvement resulting from integrating multiple data types. Our results demostrate that methylation outperformed the other data modalities, with a mean absolute error of 3.36 and 4.36 in the test sets for lung and ovary, respectively. These models achieved lower error rates when compared with established state-of-the-art tissue-agnostic methylation models, emphasizing the importance of a tissue-specific approach. Additionally, this work has shown how the application of Hierarchical Image Pyramid Transformers for feature extraction significantly enhances age modeling using histological images. Finally, we evaluated the benefits of integrating multiple data modalities into a single model. Combining methylation data with other data modalities only marginally improved performance likely due to the limited number of available samples. Combining gene expression with histological features yielded more accurate age predictions compared with the individual performance of these data types. Given these results, this study shows how machine learning applications can be extended to/in multi-modal aging research. Code used is available at https://github.com/zroger49/multi_modal_age_prediction.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"20 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-23DOI: 10.1007/s10994-024-06595-y
Chamalee Wickrama Arachchi, Nikolaj Tatti
Finding dense subgraphs is a core problem in graph mining with many applications in diverse domains. At the same time many real-world networks vary over time, that is, the dataset can be represented as a sequence of graph snapshots. Hence, it is natural to consider the question of finding dense subgraphs in a temporal network that are allowed to vary over time to a certain degree. In this paper, we search for dense subgraphs that have large pairwise Jaccard similarity coefficients. More formally, given a set of graph snapshots and input parameter (alpha), we find a collection of dense subgraphs, with pairwise Jaccard index at least (alpha), such that the sum of densities of the induced subgraphs is maximized. We prove that this problem is NP-hard and we present a greedy, iterative algorithm which runs in ({mathcal {O}} mathopen {} left( nk^2 + mright)) time per single iteration, where k is the length of the graph sequence and n and m denote number of vertices and total number of edges respectively. We also consider an alternative problem where subgraphs with large pairwise Jaccard indices are rewarded. We do this by incorporating the indices directly into the objective function. More formally, given a set of graph snapshots and a weight (lambda), we find a collection of dense subgraphs such that the sum of densities of the induced subgraphs plus the sum of Jaccard indices, weighted by (lambda), is maximized. We prove that this problem is NP-hard. To discover dense subgraphs with good objective value, we present an iterative algorithm which runs in ({mathcal {O}} mathopen {}left( n^2k^2 + m log n + k^3 nright)) time per single iteration, and a greedy algorithm which runs in ({mathcal {O}} mathopen {}left( n^2k^2 + m log n + k^3 nright)) time. We show experimentally that our algorithms are efficient, they can find ground truth in synthetic datasets and provide good results from real-world datasets. Finally, we present two case studies that show the usefulness of our problem.
寻找稠密子图是图挖掘的一个核心问题,在不同领域有很多应用。同时,现实世界中的许多网络会随时间变化,也就是说,数据集可以表示为一系列图快照。因此,我们很自然地要考虑在时态网络中寻找允许随时间变化到一定程度的密集子图的问题。在本文中,我们将寻找具有较大成对 Jaccard 相似系数的密集子图。更正式地说,给定一组图快照和输入参数((alpha)),我们会找到一个密集子图集合,其成对的杰卡德指数至少为((alpha)),从而使诱导子图的密度之和达到最大。我们证明了这个问题的 NP 难度,并提出了一种贪婪的迭代算法,该算法的运行时间为 ({mathcal {O}}mathopen {}其中 k 是图序列的长度,n 和 m 分别表示顶点数和边的总数。我们还考虑了另一个问题,即奖励具有较大成对 Jaccard 指数的子图。为此,我们将指数直接纳入目标函数。更正式地说,给定一组图快照和一个权重 (lambda),我们会找到一个密集子图集合,使得诱导子图的密度总和加上 Jaccard 指数总和(以 (lambda)加权)达到最大。我们证明这个问题是 NP 难的。为了发现具有良好目标值的密集子图,我们提出了一种迭代算法,该算法的运行时间为({mathcal {O}}left(n^2k^2+mlog n + k^3 nright)) 每次迭代的时间,以及一种贪婪算法,其运行时间为({mathcal {O}}n^2k^2 + m (log n + k^3 nright )时间内运行。我们通过实验证明,我们的算法是高效的,它们可以在合成数据集中找到地面实况,并在真实世界的数据集中提供良好的结果。最后,我们介绍了两个案例研究,展示了我们的问题的实用性。
{"title":"Jaccard-constrained dense subgraph discovery","authors":"Chamalee Wickrama Arachchi, Nikolaj Tatti","doi":"10.1007/s10994-024-06595-y","DOIUrl":"https://doi.org/10.1007/s10994-024-06595-y","url":null,"abstract":"<p>Finding dense subgraphs is a core problem in graph mining with many applications in diverse domains. At the same time many real-world networks vary over time, that is, the dataset can be represented as a sequence of graph snapshots. Hence, it is natural to consider the question of finding dense subgraphs in a temporal network that are allowed to vary over time to a certain degree. In this paper, we search for dense subgraphs that have large pairwise Jaccard similarity coefficients. More formally, given a set of graph snapshots and input parameter <span>(alpha)</span>, we find a collection of dense subgraphs, with pairwise Jaccard index at least <span>(alpha)</span>, such that the sum of densities of the induced subgraphs is maximized. We prove that this problem is <b>NP</b>-hard and we present a greedy, iterative algorithm which runs in <span>({mathcal {O}} mathopen {} left( nk^2 + mright))</span> time per single iteration, where <i>k</i> is the length of the graph sequence and <i>n</i> and <i>m</i> denote number of vertices and total number of edges respectively. We also consider an alternative problem where subgraphs with large pairwise Jaccard indices are rewarded. We do this by incorporating the indices directly into the objective function. More formally, given a set of graph snapshots and a weight <span>(lambda)</span>, we find a collection of dense subgraphs such that the sum of densities of the induced subgraphs plus the sum of Jaccard indices, weighted by <span>(lambda)</span>, is maximized. We prove that this problem is <b>NP</b>-hard. To discover dense subgraphs with good objective value, we present an iterative algorithm which runs in <span>({mathcal {O}} mathopen {}left( n^2k^2 + m log n + k^3 nright))</span> time per single iteration, and a greedy algorithm which runs in <span>({mathcal {O}} mathopen {}left( n^2k^2 + m log n + k^3 nright))</span> time. We show experimentally that our algorithms are efficient, they can find ground truth in synthetic datasets and provide good results from real-world datasets. Finally, we present two case studies that show the usefulness of our problem.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"63 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141780827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}